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Preface 



Welcome to GRID 2000, the first annual lEEE/ACM international workshop on grid 
computing sponsored by the IEEE Computer Society’s Task Eorce on Cluster 
Computing (TECC) and the Association for Computing Machinery (ACM). The 
workshop has received generous sponsorship from the European Grid Eomm (eCrid), 
the EuroTools SIC on Metacomputing, Microsoft Research (USA), Sun Microsystems 
(USA), and the Centre for Development of Advanced Computing (India). 

It is a sign of the current high levels of interest and activity in Grid computing that 
we have had contributions to the workshop from researchers and developers in 
Australia, Austria, Canada, Erance, Germany, Greece, India, Italy, Japan, Korea, The 
Netherlands, Spain, Switzerland, UK, and USA. It is our pleasure and honor to 
present the first annual international Grid computing meeting program and the 
proceedings. 

The Grid: A New Network Computing Infrastructure 

The growing popularity of the Internet along with the availability of powerful 
computers and high-speed networks as low-cost commodity components are helping 
to change the way we do computing. These new technologies are enabling the 
coupling of a wide variety of geographically distributed resources, such as parallel 
supercomputers, storage systems, data sources, and special devices, that can then be 
used as a unified resource and thus form what is popularly known as the “Grids”. The 
Grid is analogous to the power (electricity) grid and aims to couple distributed 
resources and offer consistent and inexpensive access to these resources irrespective 
of their physical location. The interest in creating Grids (by pooling resources from 
multiple organizations) is growing due to the potential for solving large-scale 
problems that typically cannot be solved with local resources. Internationally there are 
a large number of projects actively exploring the design and development of different 
Grid system components, services, and applications. Pointers to these projects can be 
found at the following sources: 

• Grid Infoware - http://www.gridcomputing.com 

• IEEE Distributed Systems Online - http://computer.org/channels/ds/gc 

It is projected that Grids are expected to drive the economy of the 21st century in a 
similar fashion to how electrical power grids drove the economy of the 20th century. 

Grid systems need to hide complexities associated with the management and usage 
of resources across multiple administrative institutions. The following are some of the 
key features of Grid infrastructures: 

• Elexibility and extensibility 

• Domain autonomy 

• Scalability 

• Global name space 
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• Ease of use and transparent access 

• Performance 

• Security 

• Management and exploitation of heterogeneous resources 

• Interoperability between systems 

• Resource allocation and co-allocation 

• Fault-tolerance 

• Dynamic adaptability 

• Quality of Service (QoS) 

• Computational Economy 

The grid must be designed and created in such a way that their components (fabric, 
middleware, and higher-level tools) and applications handle the key design issues in a 
coordinated manner. For instance. Grid middleware offers services for handling 
heterogeneity, security, information, allocation, and so on. Higher level tools, such as 
resource brokers, support dynamic adaptability through automatic resource discovery, 
trading for economy of resources, resource acquisition, scheduling, the staging of data 
and programs, initiating computations, and adapting to changes in the Grid status. In 
addition, they also need to make sure that domain autonomy is honored but still meets 
user requirements such as QoS in coordination with other components. The papers 
accepted for inclusion in these proceedings address various issues related to the 
design, development, and implementation of Grid technologies and their applications. 



Program Organization and Acknowledgements 

The response to the workshop’s call for papers has been excellent and we expect that 
attendance at the actual workshop will be equally impressive. The GRID 2000 
program consists of a keynote speech (by Wolfgang Gentzsch on “DOT-COMing the 
GRID: Using Grids for Business”), an invited talk, and refereed technical paper 
presentations. We have accepted papers from authors of fifteen countries from among 
submissions from eighteen countries. We would like to thank all authors for 
submitting their research papers for consideration. We have grouped the contributed 
papers into five distinct categories, although inevitably there is some overlap: 

• Network enabled server systems for the Grid (invited paper) 

• Grid resource management 

• Grid middleware and problem solving environments 

• Grid testbeds and resource discovery 

• Application-level scheduling on the Grid 

The GRID 2000 meeting would not have taken place without the efforts of Viktor 
Eras anna, who has been the main driving force behind the international conference on 
High Performance Computing (HiPC). It is our pleasure to acknowledge his efforts 
and thank him for encouraging us to organize this annual internal meeting on Grid 
computing. The success of the workshop is wholly due to the hard work of the 
program committee members and external reviewers. They have donated their 
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precious time for reviewing and offered their expert comments on the papers. All 
submitted papers have been peer reviewed by the technical program committee 
members and external referees. We requested four reviews for each paper and ensured 
that each paper received a minimum of three reviews. All highly recommended and 
promising works have been selected for presentation at the meeting. 

We thank our keynote speaker Wolfgang Gentzsch (Director of Network 
Computing, Sun Microsystems) and invited speaker Satoshi Matsuoka (Tokyo 
Institute of Technology, Japan) for presenting their vision on Grid technologies. 

We owe a debt of gratitude to all our sponsors and contributors. In particular, we 
would like thank R.K. Arora (C-DAC, Pune), Mohan Ram (C-DAC, Bangalore), and 
Wolfgang Gentzsch (Sun Microsystems) for responding to our request for financial 
support enthusiastically and being instrumental in obtaining generous donations from 
their respective organizations. Our special thanks go to Todd Needham (Microsoft 
Research, USA), who has voluntarily come forward to support our Task Force 
activities. We would also like to thank Hilda Rivera (ACM) for handling our request 
for ACM “in-cooperation” status. We thank Jarek Nabrzyski for his help in gathering 
the European Grid forum support for this workshop. Finally, we would like to thank 
the Springer- Verlag team, particularly Jan van Leeuwen (LNCS series editor), Alfred 
Hofmann (Executive Editor), Antje Endemann, and Karin Henzold. They are 
wonderful to work with! 

We hope these proceedings serve as a useful reference on Grid computing. We 
wish you all the best and hope you enjoy your visit to the Silicon Valley of India! 
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DOT-COMing the GRID: Using Grids for Business 



Wolfgang Gentzsch 

Sun Microsystems Inc, Palo Alto, California, USA 

Abstract: In this presentation, a short outline of the history of past and present Grid 
projects in research and industry is given, followed by some near- and long-term Grid 
scenarios and visions on how data and compute Grids will complement current Internet 
services and thus change our working and living environments and habits. In essence, 
implementation and professional exploitation of the complex and highly sophisticated 
Grid technologies will still take a couple of years and give us time enough to adapt to 
the dramatic changes and potential opportunities Grids will create in the future. 



1. Grids in Research 

In the early Nineties, research groups all over the world started exploiting distributed 
computing resources over the Internet: scientists collected and utilized hundreds of 
workstations for highly parallel applications like molecular design and computer 
graphics rendering. Other research teams glued large supercomputers together into a 
virtual metacomputer, distributing subsets of a meta-application (e.g. the computer 
simulation of multi-physics applications) to specific vector, parallel and graphics 
computers, over wide-area networks. 

The scope of many of these research projects was to understand and demonstrate 
the actual potential of the networking, computing and software infrastructure and to 
develop it further. This led us to Internet infrastructure projects like Globus and 
Legion, which enable users to combine nearly any set of distributed resources into 
one integrated metacomputing workbench to allow users to measure nature (e.g. with 
microscope or telescope), process the data according to some fundamental 
mathematical equations (e.g. the Navier-Stokes equations), and provide computer 
simulations and animations to study and understand these complex phenomena. 

These projects created a new era in distributed computing, according to the book 
The Grid: Blueprint for a New Computing Infrastructure. Generally speaking, a 
computational Grid is a hardware and software infrastructure that provides 
dependable, consistent, pervasive, and inexpensive access to computational 
capabilities. These Grids, in the near future, will be used by computational engineers 
and scientists, associations, corporations, environment, training and education, states, 
consumers, etc. They will be dedicated to on-demand computing, high-throughput 
computing, data-intensive computing, collaborative computing, and supercomputing, 
potentially on an economic basis. Grid communities, among others, are national Grids 
(like ASCI), virtual grids (e.g. for research teams), private grids (e.g. a BMW 
CrashNet for the car manufacturer BMW and its suppliers, for collaborative crash 
simulations), and public grids (e.g. consumer networks). 



R. Buyya and M. Baker (Eds.:) GRID 2000, LNCS 1971, pp. 1-3, 2000. 
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Today, we see the first attempts to more systematically exploit Grid computing 
resources over the Internet. Distributed computing projects like SETI@home, 
Distributed.Net, and Folderol, let Internet users download scientific data, run it on 
their own computers using spare processing cycles, and send the results back to a 
central database. Recently, Compute Power Market project has been initiated to 
develop software technologies that enable creating Grids where anyone can sell idle 
CPU cycles, or those in need can buy compute power much like electricity today. 

2. Grids in Industry 

Encouraged by the response of thousands of participants on these research initiatives, 
new Internet startup companies like Popular Power, Entropia, Distributed Science, 
and United Devices are trying to turn this idea into real business that resell untapped 
resources for a profit, hoping that computer users will be interested to donating their 
extra computing power to projects that crunch a lot of data, such as the search for a 
new cancer drug or patterns in the human genome. Other potential candidate 
applications are complex financial analysis and generation of intensive graphics. 

While this kind of global (and ’wild’) Internet computing will probably be 
successful in the future where privacy and security are only minor issues (i.e., mostly 
in research-oriented projects), global industries might have some real concerns in 
using this Internet computing technology for their strategic businesses. Beside 
security of information and data, these companies need guarantees for the availability 
and utilization of dedicated resources, high-level quality of services, easy, fast and 
authenticated computing portal access to hardware and software, and tools for 
accounting, reporting, monitoring, and planning. 

Just recently, industry started to experiment with more commercially oriented e- 
business models for high-performance and data-intensive computing via the Internet. 
Eor example, debis Systemhaus, a DaimlerChrysler company in Germany, offers its 
NEC SX-5 supercomputer power through an Internet e-commerce gateway using a 
public web server, a secure web server and a discussion server. The web pages are 
based on JAVA applets, CGI scripting and JAVA servlets. In addition an LDAP 
customer database is used for the management of security and encryption certificates. 
A user can register using HTML forms; the secure Web site requests certificates to 
identify user; a hummingbird UNIX desktop from the browser redirects application to 
customer desktop; and Pegasus, (the application dependent job submission GUI), 
submits the job to the batch system. 

3. Grid Resource Management 

Most of the underlying sophisticated technologies are currently under development. 
Large research communities like the GridForum and EGrid are coordinating all kinds 
of Grid research, prototype Grid environments exist like public-domain Globus and 
Legion, research in resource management is underway in projects like EcoGrid, and 
the basic building block for a commercial Grid resource managers exists with Sun’s 
Grid Engine software. Grid Engine is a new generation distributed resource 
management software which dynamically matches users’ hardware and software 
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requirements to the available heterogeneous resources in the network according to 
predefined policies usually prescribed by the management in the enterprise. 

The Grid Engine acts much like our body’s central nervous system (sometimes 
called The Body’s Internet’). The Grid Engine Master (’the brain’) with its sensors in 
every computer (comparable to the sensations of touch, sound, smell, taste, and sight) 
dynamically acts and reacts, according to set policies (comparable to move, eat, drink, 
sleep) to allow for full control and achieve optimum utilization and efficiency. Grid 
Engine has been developed as an enhancement of Codine from former Grid ware Inc, 
according to well defined requirements from the Army Research Lab in Aberdeen, 
and BMW in Munich, where today Grid Engine manages over 800 powerful compute 
servers in each of these local Grids. Average usage increased from well under 50% to 
over 90%, in both environments. 

4. Future Grid Economies 

The next step is to enhance Grid Engine, which currently is restricted to manage local 
computer resources, towards The GRID Broker’, which will be able to match the 
user’s compute jobs with the available resources in the network, including invoicing 
users for the CPU power they consume, very much like todays electric power 
consumption, telephone usage or water supply. The Grid broker will match the user’s 
requirements to the best fitting Application Service Provider (ASP) in the universe 
which optimally fulfills the user’s hardware, software and service needs. 

This GRID Broker belongs to the enabling technologies of the next Internet Age. 
The Internet, for a long time, has been used only for information. Only recently, 
enabled by several important improvements in hardware infrastructure, security, 
authentication, and ease of access, it is used for electronic commerce. And just now, 
the next revolutionary step complementing the Internet can be foreseen: The Grid 
Computing Infrastructure, i.e. all kinds of dedicated GRIDs used for collaboration and 
collaborative computing in industry and research, for application simulation and 
animation, for real-time video, on-demand virtual reality presentations, and other 
services for consumers and producers. 

This high-quality and economically oriented usage of the Internet will be enabled 
by several new technologies and achievements made recently. Eor example, CORBA 
offers a standard interface definition to interconnect any distributed object in the 
world. JAVA provides a common platform for distributed resources and thus 
guarantees full cross-platform portability, and JINI allows to interconnect any 
electronic device in a scalable way. And the chaos which potentially can arise with 
this wealth of interconnected devices, clusters, subgrids, and grids, will be removed 
and brought into a well-organized and well-functioning ’organism’, by the GRID 
Resource Broker, supported by intelligent agents which, through the network or 
wireless, report to the Central Grid Engine the details on available resources, and the 
consumers’ habits and needs for specific resources in the GRID. 

Then, eventually, in a next (and final?) step, the central Grid Engine will disappear, 
partly as an integrated component of the local operating systems, and partly being 
replaced by intelligent mobile agents, which enable a universal and self-healing 
environment with potentially infinite compute power available on-demand, and as 
easily accessible as our today’s electricity, telephony, roads and water infrastructures. 
Eor URLs to projects referenced the paper see: gridcomputing.com. 
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Abstract. Network Enabled Server is considered to be a good candi- 
date as a viable Grid middleware, offering an easy-to-use programming 
model. This paper clarifies design issues of Network Enabled Server sy- 
stems and discusses possible choices, and their implications, namely those 
concerning connection methodology, protocol command representation, 
security methods, etc. Based on the issues, we have designed and im- 
plemented new Ninf system v.2.0. Eor each design decision we describe 
the rationale and the details of the implementation as dictated by the 
choices. We hope that the paper serves as a design guideline for future 
NES systems for the Grid. 



1 Introduction 

A Network Enabled Server System(NES) is an RPC-style Grid system where 
a client requests the service of a task to a server. There are several systems 
that adopt this as the basic model of computation, such as our Ninf system[l], 
Netsolve[2], Nimrod[3], Punch[4], and Grid efforts utilizing GORBA[5,6,7]. 

NES systems provides easy-to-use, intuitive, and somewhat restricted user 
and programming interface, This allows the potential users of Grid systems to 
easily make his applications “Grid enabled”, lowering the threshold of accep- 
tance. Thus, we deem it as one of the important abstractions to be layered on 
top of lower-level Grid services such as Globus [8] or Legion [9]. 

Since 1995, we have been conducting the Ninf project, whose goal has been 
to construct a powerful and flexible NES system [10,11], and have investiga- 
ted the utility of such systems through various application and performance 
experiments [12]. There, we have gained precious experience on the necessary 
technical aspects of NES systems which distinguishes them from conventional 
RPG systems such as GORBA, as well as various tradeoffs involved in the de- 
sign of such systems [13]. Based on such observations, we have redesigned and 
reimplemented version 2.0 of the Ninf system from scratch. 
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Fig. 1. General Architecture of NES Systems 



The purpose of this paper is to discuss the notable technical points which led 
to the design decisions made for Ninf v.2.0. In particular, for the latter half of 
the paper we focus on the security issues, which is mostly lacking in the current 
generation of NES systems. 



2 General Overview of NES Systems 

In general, NES systems consists of the following components: (Figure 1) 

— Clients: Requests execution of grid-enabled libraries and/or applications to 
the server. 

— Servers: Receives request from clients, and executes the grid-enabled libra- 
ries and/or applications on clients’ behalf. 

— Scheduler: Selects amongst multiple servers for execution according to the 
information obtained from the resource database. 

— Monitors: Monitors the status of various resources, such as computing re- 
source, communication resource, etc., and registers the results in the resource 
database. 

— Resource Database: Stores and maintains the status of monitored resour- 
ces. 

The Monitors periodically “monitor” the status of resources such as the ser- 
ver, network, etc., and registers the results in the Resource Database. The users 
of Grid systems modifies his applications to utilize the servers with the use of 
client APIs, or tools that have been constructed using the client APIs. The Cli- 
ent inquires the Scheduler for an appropriate Server. The Scheduler, in turn, 
acquires the info on computing resources, and selects the appropriate server ac- 
cording to some scheduling algorithm, and returns the selection to the Client. 
The Client then remotely invokes the library /application on the selected server 
by sending the appropriate argument data. The server performs the computa- 
tion, and returns the result data to the client. 
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2.1 Design Issues in NES Systems 

There are several design issues regarding the construction of NES systems, in- 
cluding the connection methods of client and servers, communication protocols, 
and security. Moreover, there is an issue of how we make the system open to 
future extensions. 



Client-Server Connection Methodologies. The client must first establish a 
connection with the selected server. The sub-issues involve 1) continuous connec- 
tion versus connection-by-necessity, and 2) usage of proxies. 

Continuous Connection versus Connection-hy- Necessity: Continuous connection 
maintains connection between the server and the client during the time server is 
performing the computation. Contrastingly, Connect ion- by-Necessity makes hue- 
grain connection/disconnection between the client and the server on demand. 

Continuous connection is typically employed for standard RPC implemen- 
tations; it is easy to implement under the current TCP/IP socket APIs, and 
furthermore, allows easy detection of server faults via stream disconnection. The 
drawback is the restriction on how many parallel tasks that can be invoked by a 
client. Since the connection to the server must be maintained, the client process 
is requires more hie descriptors than the number of parallel tasks being invoked. 
However, since the number of hie descriptors per process is restricted for most 
OSes, this limits the number of parallel tasks. Such has not been a problem for 
traditional RPCs, since most transactions are short-lived, and/or the number of 
connections were small since the user tasks are sequential. 

Moreover, continuous connection requires the client to constantly be on-line, 
without any interruption in the communication. Thus, the client cannot go oh- 
line, neither deliberately nor by accident; even a momentary failure in the com- 
munication will cause a fault. This again is a restriction, since some Grid-enabled 
libraries may take hours or even days to compute. 

By contrast, in Ninf v.2.0 we have adopted Connect ion- by-Necessity. Basi- 
cally, when the client makes an RPC request to the server, it disconnects once 
the necessary argument data had been sent. Once the server hnishes the com- 
putation, it re-establishes a new connection with the client, and sends back the 
result. This overcomes the restriction of the Continuous Connection, but a) the 
protocol becomes more complex, due to the requirement of server-initiated and 
secure connection re-establishment, b) there need to be an alternative method of 
detecting server faults, and c) performance may suffer due to connection costs. 

Direct Connection versus Proxy-based Connection. Another concern is whether 
to connect the client and the server directly, or assume a dedicated, mediating 
proxies, for various purposes including connection maintenance, performance 
monitoring, and firewall circumvention. 

The “old” Ninf system (up to v.1.2.) employed proxy- mediated connection, 
for the purpose of simplifying the client libraries. All traffic was mediated by 
the proxy; in fact, communication with the Scheduler for server selection was 
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performed by the proxy and not by the client itself. On the other hand, routing 
the communication through the proxy will result in performance overhead, which 
is of particular concern for Grid systems since communication of large bulk of 
data is typical. 

Communication Protocol Commands for Grid RPC. Communication 
Protocol Commands, or simply Protocol Commands are a set of commands that 
are used to govern the communication protocol between the client and the server. 
They can be largely categorized into binary formats and text-based formats. 

Binary formats allow easy and lightweight parsing of command sequences, 
but are difficult to structure, debug and extend. Contrastingly, well-designed 
text-based formats are well-structured, easy to understand and extent, but are 
less efficient and require more software efforts to parse. 

Although traditionally text-based commands for communication protocols 
were typically simple, involving little structure such as S-expressions, there is 
a recent trend to employ XML for such purpose. Although XML requires more 
efforts on the software side for parsing etc., we can assign schema in a standard 
way using DTD. Since command overhead can be amortized over relative large 
data transfer, we believe XML is a viable option given its proliferation as well 
as availability of standard tools. 

Security Mechanism. Security is by all means an important part of any Grid 
system. However, there several options for security, depending on the operating 
environment of the system. 

If the operating environment is totally local within some administrative do- 
main, where all the participants can be trusted, we can merely do away with 
security. In a slightly more wide-area and well-administered environment, such 
as within a University campus, it suffices to restrict access based on, say, client 
IP address. On the other hand, if global usage is assumed, then by all means we 
must guard against malicious users, and thus require authentication based on 
encryption. Examples are Kerberos, which employs the symmetrical key techno- 
logy, and SSL, which utilizes the public key algorithm. 

System Openness and Interoperability with Other Grid Systems. One 

important design choice is how much we make the system open to customiza- 
tion, especially with respect to other, more general Grid software infrastructure, 
and/or Grid component with some specific function. More concretely. Grid tool- 
kits such as Globus provide low-level communication layer, security layer, di- 
rectory service, heartbeat monitoring, etc. Components such as NWS (Network 
Weather Service [14]) provides stable monitoring and prediction services for mea- 
suring resources on the Grid, such as node CPU load and network communica- 
tion performance. Conventional components which had initially not intended 
as a Grid services could be incorporated as well, such as LDAP, which provi- 
des a standard directory service API; Globus employs LDAP directly with its 
MDS (Metacomputing Directory Service), providing a Grid directory service. 
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By using such existing subsystems and components, we can directly utilize the 
functionalities which had been tried and tested, and also subject to independent 
improvement. On the other hand, because such subsystems are designed for 
generality, they have larger footprint, and could be tougher to manage. Moreover 
the supported platform would be the intersection of the platforms supported by 
individual subsystems. 

3 Design and Implementation of the New Ninf System 

3.1 Conceptual Design Decision Overview 

We designed and implemented a new version of the Ninf system (Ninf version 
2.0) with the abovementioned design issues in mind. The new system is designed 
to be flexible and extensible, with interoperability with existing Internet and 
Grid subsystems in mind. Because NES systems typically involve tasks where 
computation is dominant, we made design decisions that gave precedence to in- 
teroperability and flexibility over possible communication overhead if such could 
be amortized. 



Client-Server Connections. In order to accommodate multiple, fault-tole- 
rant, long-running calls in Grid Environments, we adopted for connect ion- by- 
necessity over continuous connections. We have also decided to employ proxy- 
based connections in order to simplify client structure. However, in order to 
avoid bandwidth bottlenecks, proxies only intervene on command negotiations 
between the client and server; when the actual arguments of the remote call 
is being transferred, the client and the server communicate directly, unless a 
firewall must be crossed. 



Communication Protocol Commands. Eor flexibility, extensibility, and in- 
teroperability, we decided to adopt the usage of XML-based text commands. 
In the latter sections we present an overview of the DTD schema for numerical 
RPGs. Eree parsers for G and Java are available, which simplified our implemen- 
tation. 



Security Mechanism. To allow Ninf to be used in a global Grid environment, 
we opted to construct a Globus-like, SSL-based authentication and authorization 
layer, which allows delegation of authentication along a security chain. Kerberos 
was an obvious alternative, but SSL was becoming a commercial standard, and 
multiple free library implementations in G and Java are available. 



System Openness and Interoperability with Other Grid Systems. This 
was the most difficult decision, since advantages and disadvantages of employing 
existing Grid components could be strongly argued both ways. As a compromise, 
we have decided to provide default implementations of all our basic submodules; 
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Fig. 2. Overview of the New Ninf System 



however, we have designed them to have well-defined interfaces, to be pluggable 
with existing modules in operating environments where such services are already 
available. For example, although the default implementation of the resource da- 
tabase lookup service has its own LDAP lookup feature, it could also directly 
utilized Globus MDS services where they are available. 



3.2 Overview of the New Ninf System V.2.0 

The new Ninf system v.2.0 is composed of the following subsystems (Figure 2) 



— Client 

A user-side component which requests (parts of) computing to be done on 
remote servers in the Grid. The client is “thin” in a sense that as little 
information as possible is retained on the client side; for example IDL of the 
remote call is not maintained by the client, but rather automatically shipped 
on demand from the server. 

— Server 

Receives remote compute requests from the clients and invokes the appro- 
priate executable. The server might act as a backend for invoking paral- 
lelized libraries on multiple compute nodes, such as a library written in 
C/Fortran-t-MPI served by a Cluster. 

— Proxy 

Communicates with a Scheduler on behalf of the Client, and decides upon 
which server to invoke the remote computation, and forwards the request to 
the server. (The behavior of the proxy is similar to Netsolve Agents in this 
case.) 
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— Executables 

Components which actually embeds each remote applications or libraries to 
be invoked. They are invoked by the server, and communicate with the client 
to perform the actual computations. 

— Data Storage 

Temporary storage on the Grid to store intermediate results amongst mul- 
tiple servers. 

— Scheduler 

The scheduler receives requests from the proxy, and selects an appropriate 
server under some scheduling algorithm. The scheduler communicates with 
the database server in order to drive the scheduling algorithm. 

— Database Manager 

Manages the Information Stored in the Grid resources database. The da- 
tabase itself utilizes existing distributed resource database for the Internet 
and/or the Grid (e.g., LDAP or Globus MDS, which in turn uses LDAP 
itself); the resource lookup request from the client is delegated through the 
manager. This naturally allows other database infrastructure to be utilized. 

— Network Monitor/Server Monitor 

Monitors the status of the network, servers, and other resources. The result 
is reported periodically and automatically to the database manager. 

— Fault Manager 

Performs recovery action when some fault or error that affects the system in 
a global way, is detected. For example, if the server is found to be down (using 
heartbeat monitoring), the server is deleted from the resource database. 



3.3 Client-Server Communication in the Ninf System 2.0 

The new Ninf system manages the client-server communication in the following 
manner (Figure 3): 

The client first requests the interface information of the executable to be 
invoked to the proxy. It then requests the invocation of the executable. The 
client immediately disconnects its connection with the proxy, and enters the 
state waiting for a callback from the proxy. The client then can proceed to issue 
hundreds of simultaneous requests, as there are no other pending connections. 

The proxy in turn inquires the Scheduler for selection of an appropriate ser- 
ver (or a set of servers) to perform the invocation. The scheduler inquires the 
database manager for information on servers and network throughput informa- 
tion, as well as other resource information such as location of files used in the 
computation. The scheduling algorithm selects an appropriate server (or set of 
servers) and returns the info to the proxy. The algorithm itself is pluggable; one 
can employ simple algorithm as is employed with netsolve (sorting by server 
load), or more sophisticated algorithm such as those employed by Nimrod. 

The proxy forwards the invocation request to the selected server. The server 
in turn invokes the executable for performing the actual computation. The exe- 
cutable then requests to the client the necessary arguments by sending the appro- 
priate IDL program for marshalling. When all the arguments have been received. 
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Fig. 3. Invocation Protocol 



the executable notifies the client, disconnects the connection, and proceeds to 
compute the request. The client again enters the state to wait for callback from 
the executable on completion of the invocation. 

When the computation is finished, the executable reconnects with the client, 
and transmits the result, indicating termination of the invocation. The client 
acknowledges the receipt with the termination command. 

Finally, the executable notifies the proxy that the invocation has terminated. 
The proxy in turn forwards this to the client. The proxy notifies the Database 
Manager of the termination, allowing it to update the resource database. 



3.4 Communication Protocol Commands in New Ninf 

As an example of communication command protocol, we demonstrate the DTD 
of the protocol command for specifying and invoking on a server a remote exe- 
cutable, in Figure 4. Based on this DTD, here is sample invocation command in 
XML (Figure 5). 

One may notice that the invocation command embodies two addresses, cli- 
ent and observer; here, client is the address used for client callbacks, where as 
observer is the address used to notify termination of invocation to the proxy. 



3.5 Security Layer in the New Ninf System 

Security in a NFS system involves Authentication, Authorization, Privacy. Aut- 
hentication identifies who is connecting to the server; authorization is what re- 
sources to permit to the user that has been identified; and privacy is to make 
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< ! ELEMENT invoke _exe cut able 

(issuer jfunctionjnajne, client, observer) > 

<! ELEMENT issuer EMPTY> 

<!ATTLIST issuer process CDATA #REQUIRED> 
<!ATTLIST issuer host CDATA #REQUIRED> 
<!ATTLIST issuer port CDATA #REQUIRED> 
<!ATTLIST issuer session_key CDATA #REQUIRED> 

<! ELEMENT function_name EMPTY> 

<!ATTLIST function_name module CDATA #REQUIRED> 
<!ATTLIST function_name entry CDATA #REQUIRED> 
<! ELEMENT client (peer)> 

<! ELEMENT observer (peer)> 

<! ELEMENT peer EMPTY> 

<!ATTLIST peer host CDATA #REQUIRED> 

<!ATTLIST peer port CDATA #REQUIRED> 



Fig. 4. Remote Executable Command DTD 



<invoke_executable> 

<issuer process="nsserver" 
host="hpc . etl . go . jp" 
port="30000" session_key=" 12345" /> 
<function_name module="test " entry="mmul" /> 
<client> 

<peer host="hpc . etl . go . jp" port="30000" /> 
</ client> 

<observer> 

<peer host="hpc . etl . go . jp" port="30001" /> 
</observer> 

</ invoke_executable> 



Fig. 5. Example Invocation Command 



communication and computation private to other users connecting to the NES 
system. 

The new Ninf system has the client connect to the server via a shared proxy; 
however, server authentication and authorization must be performed with client 
identify, with (rather remote but still existing) possibility that the proxy may be 
spoofed. Another situation is when server A acts as a client and delegates part of 
its work to server B on another machine. There, not only that server A needs to 
be authenticated, but the client identity must be authenticated and authorized 
at server B as well. Such “delegation of identity” we deem as essential part of a 
NES system 
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The new Ninf system implements the NAA (NES Authentication Authoriza- 
tion) module. NAA employs SSL as the underlying encryption mechanism and 
implements delegation of identity and authorization on top of those. Delegation 
of identity is done automatically by the NAA, and the client user merely needs to 
specify his certificate as is done with SSL. NAA itself is relatively self-contained, 
and thus could be used by other NES systems such as Netsolve. 

Delegation of Identity. Identity in SSL consists of a certificate certified by a 
CA (Certificate Authority). CA’s can be made hierarchical — it is possible to sign 
a certificate using another (signed) certificate. In NAA, we have implemented 
delegation of identity by not merely directly tying in user identity with his 
certificate, but rather, broadened the ‘identity’ to include all the certificates 
signed using the user’s certificate. 

SSL employs the public key encryption algorithm, where its certificate con- 
sists of user’s public key being encrypted by CA’s private key. We can form a 
so-called certificate chain by generating another key pair, and encrypting them 
with the user’s private key. On authentication, CA’s public key is used to decrypt 
the certificate, which reveals the public key of the user. This could be used for 
identification (by decrypting data which had been encrypted with the private 
key of the user), or for a chain, could in turn be used to obtain the public key 
of the next element in the chain. NAA uses such a certificate chain for authen- 
tication, in that if the user’s certificate appears somewhere in the chain, it is 
regarded as providing the user’s indentity. The security layer of Globus employs 
a similar strategy [15]. 

As an example, let us consider when the client calls server A, which in turn 
calls server B as a client. When server A receives a connection request from 
the client, it generates a new key pair, and sends back its public key to the 
client, which is asked to create a session certificate embodying its identity. The 
client generates a session certificate by signing (encrypting) the public key with 
its own private key, and sends it back to server A. When server A connects to 
server B, server B must 1) authenticate the identity of the client, as well as 2) 
identify that the call is being made through server A. This is achieved by server 
A connecting with the session certificate received from the client along with the 
original certificate of the client. This is shown in Figure 6. 



NAA Policies. We have designed NAA policies to be extensible and customiz- 
able by the system administrators. 

After the client is authenticated, authorization in NAA is performed using a 
structure similar to Java 1.2 Policy class. A policy is a set of structures called 
grants^ which in turn are sets of permissions to the user. The NAA library ma- 
nages the policy structures and the identities of current clients of the system. In 
addition, NAA namespace is tree- structured according to the X.509 conventions. 
Access control is done hierarchically done along this tree using the permissions. 

The server program inquires whether the certain permission is applicable to 
the client. The library checks the policy if there are grants that contain the 
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Fig. 6. Delegation of Identity 



<! ELEMENT policy (grant) *> 

<! ELEMENT grant (permission) *> 

<!ATTLIST grant userid CDATA #REQUIRED> 

<! ELEMENT permission EMPTY> 

<!ATTLIST permission class CDATA #REQUIRED> 
<!ATTLIST permission target CDATA #REQUIRED> 
<!ATTLIST permission action CDATA #REQUIRED> 



Fig. 7 . Policy DTD 



particular permission. Each permission consists of three attributes, class^ target, 
and action. Class indicates the operation that the permission allows the client to 
perform. Target and action designates the subject of the operation, along with 
the type of the operation to be performed. 

Policies are described using XML in a policy file. We illustrate the policy file 
DTD and an example of policy description in Figure 7 and Figure 8, respectively. 








Design Issues of Network Enabled Server Systems for the Grid 



15 



<policy> 

<grant userid="c=jp,o=etl"> 

<permission class = "stubexec" 

target = "test/entryO" action="100 20"/> 
<permission class = "stubexec" 

target = "test/entryl" action="100 20"/> 
< /grant > 

<grant user id= " c= j p , o=etl , CN=nakada" > 
<permission class = "stubexec" 

target = "test/entry3" action="100 20"/> 
< /grant > 

</policy> 



Fig. 8. Example Policy 



In the example, we have defined two grants. The first grant indicates that 
the user whose identity includes c=jp,o=etl (meaning the Electrotechnical Lab) 
can remotely execute test/entryO and test/entryl. The second grant restricts 
test/entry3 to only be remotely executed by userlD c=jp, o=etl ,CN=nakada. 
Thus, the client c=jp, o=etl ,CN=nakada can execute all the remote libraries 
(entryO, entry 1, and entryS), while the client c=jp, o=etl ,CN=sekiguchi 
can execute only (entryO and entry 1); furthermore, the client c=jp, 
o=titech, CN=matsuoka cannot execute any of the libraries. 

We can also grant rights to specific calls made by the client through delegation 
of identity; for instance, in the delegation of identity scenario described earlier, 
we can specify a certain executable to be invoked only if a particular client 
was executing a library in server A which in turn had called the executable in 
server B. Such a case is conceivable, when a large compute server B is used as a 
backend for a server A, which is more subject to public usage; contrastingly, only 
a restricted set of jobs could be run on server B, and users are not allowed to 
invoke a remote library on server B directly; rather, they must do so via server 
A. 

In this manner, the hierarchical namespace, along with the policy structure, 
gives fine-grained access control of resources for remote libraries in a NES system. 
Preliminary measurement have shown that such mechanisms do not impose sig- 
nificant overhead, as long as the calls granularity is large enough such that the 
overhead could be amortized (beyond 10s of seconds). 

4 Conclusion 

We have covered the technical tradeoff points of NES systems, and described 
how the new Ninf system v.2.0 had been designed with the tradeoffs in mind, 
with descriptions of why a particular choices in the tradeoffs had been made. 
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We hope that most of the design spaces have been covered, and will serve as a 
guide for designing future NES systems. 

We are currently in the stage of deploying Ninf v.2.0 alongside v.1.0 to com- 
pare and verify the effectiveness of the design decisions, along with performance 
analysis to assess the their impact as well. 
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Abstract: The concept of coupling geographically distributed (high-end) resources for 
solving large-scale problems is becoming increasingly popular, forming what is 
popularly called grid computing. The management of resources in the grid 
environment becomes complex as they are (geographically) distributed, heterogeneous 
in nature, owned by different individuals/organizations each having their own resource 
management policies and different access-and-cost models. In this scenario, a number 
of alternatives exist while creating a framework for grid resource management. In this 
paper, we discuss the three alternative models — hierarchical, abstract owner, and 
market — for grid resource management architectures. The hierarchical model exhibits 
the approach followed in (many) contemporary grid systems. The abstract owner 
model follows an order and delivery approach in job submission and result gathering. 
The (computational) market model captures the essentials of both hierarchical and 
abstract owner models and proposes the use of computational economy in the 
development of grid resource management systems. 



1. Introduction 

The growing popularity of the Internet and the availability of powerful computers and 
high-speed networks as low-cost commodity components are changing the way we do 
computing and use computers today. The interest in coupling geographically 
distributed (computational) resources is also growing for solving large-scale 
problems, leading to what is popularly known as grid computing. In this environment, 
a wide variety of computational resources (such as supercomputers, clusters, and 
SMPs including low-end systems such as PCs/workstations), visualisation devices, 
storage systems and databases, special class of scientific instruments (such as radio 
telescopes), computational kernels, and so on are logically coupled together and 
presented as a single integrated resource to the user (see Figure 1). The user 
essentially interacts with a resource broker that hides the complexities of grid 
computing. The broker discovers resources that the user can access through grid 
information server(s), negotiates with (grid-enabled) resources or their agents using 
middleware services, maps tasks to resources (scheduling), stages the application and 
data for processing (deployment) and finally gathers results. It is also responsible for 
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monitoring application execution progress along with managing changes in the grid 
infrastructure and resource failures. There are a number of projects worldwide [5], 
which are actively exploring the development of various grid computing system 
components, services, and applications. They include Globus [7], Legion [9], 
NetSolve [10], Ninf [15], AppLes [11], Nimrod/G [3], and JaWS [16]. In [2], all these 
grid systems have been discussed. 




Grid Information Serverts) 



Grid Resource Broker 

(Resource discouery oid scheduling) 



Clienl-1 ... CtfenirN 

Applications 



Figure 1: A Generic View of GRID System. 

The current research and investment into computational grids is motivated by an 
assumption that coordinated access to diverse and geographically distributed 
resources is valuable. In this paradigm, it is not only important to determine 
mechanisms and policies that allows such coordinated access, but it also seems 
reasonable that owners of those resources, or of mechanisms to connect and utilize 
them should be able to recoup some of the resulting value from users or clients. 
Approaches to recouping such value in the existing Internet/web infrastructure, where 
e-commerce sites use advertising and/or mark-ups on products sold to show revenue, 
do not translate well (or are unsuitable) to a computational grid framework, primarily 
due to the fact that the immediate user of any specific resource in a computational 
grid is often not a human. Instead, in a grid, many different resources, potentially 
controlled by diverse organizations with diverse policies in widely-distributed 
locations, must all be used together, and the relationship between the value provided 
by each resource and the value of the product or service delivered to the eventual 
human consumer may be very complex. In addition, it is unrealistic to assume that 
human-created contracts can be developed between all potential resource users and 
resource owners in these situations, since the potential of computational grids can 
only be fully exploited if similar resources owned by different owners can be used 
almost interchangeably. 

Still, the existing real world must be acknowledged. Grid resources are largely 
owned and used by individuals or institutions who often provide "free" access for 
solving problems of common interest/public good (e.g., SETI@Home [13]), 
prize/fame (e.g., distributed.net [14] response to challenge for breaking RSA security 
algorithms), collaborative resources (GUSTO [6]), or by companies that are loathe to 
allow others to use them, primarily due to concerns about competition and security. 
The existing control over resources is subject to different policies and restrictions, as 
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well as different software infrastructure used to schedule them. Any new approach to 
manage or share these resources will not be viable unless it allows a gradual layering 
of functionality or at least a gradual transition schedule from existing approaches to 
more novel ones. Even in the existing cases where money does not actually change 
hands, it is often important to provide a proper accounting of cross -organizational 
resource usage. In order to address these concerns, we propose different approaches 
for modeling grid resource management systems. 



2. Architecture Models 

As the grid logically couples multiple resources owned by different individuals or 
organisations, the choice of the right model for resource management architecture 
plays a major role in its eventual (commercial) success. There are a number of 
approaches that one can follow in developing grid resource management systems. In 
the next three sections, we discuss the following three different models for grid 
resource management architecture: 

• Hierarchical Model 

• Abstract Owner Model 

• Computational Market/Economy Model 

In the first, we characterize existing resource management and scheduling 
mechanisms by suggesting a more general view of those mechanisms. Next, we 
suggest a rather idealistic and extensive proposal for resource sharing and economy, 
which for the most part, ignores existing infrastructure in order to focus on long-term 
goals. Einally, we describe a more incremental architecture that is already underway 
to integrate some aspects of a computational economy into the existing grid 
infrastructure. Table 1 shows a few representative systems whose architecture 
complies with one of these models. 

Table 1: Three Models for a Grid Resource Management Architecture. 



MODEL 


REMARKS 


SYSTEMS 


Hierarchical 


It captures architecture model followed in 
most contemporary systems. 


Globus, Legion, 
Ninf, NetSolve. 


Abstract Owner 


It follows an order and delivery model for 
resource sharing, which for the most part, 
ignores existing infrastructure in order to 
focus on long-term goals. 


Expected to 
emerge. 


Market/Economy 


It follows economic model in resource 
discovery and scheduling that can co- 
exist or work with contemporary systems 
and captures the essence of both 
hierarchical and abstract owner models. 


Nimrod/G, JaWS, 
Myriposa, 
JavaMarket. 
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The grid architecture models need to encourage resource owners to contribute their 
resources, offer a fair basis for sharing resources among users, and regulate resource 
demand and supply. They influence the way scheduling systems are built as they are 
responsible for mapping user requests to the right set of resources. The grid 
scheduling systems need to follow multilevel scheduling architecture as each resource 
has its own scheduling system and users schedule their applications on the grid using 
super- schedulers called resource brokers (see Figure 1). 



3. Hierarchical Resource Management 

The hierarchical model for grid resource management architecture (shown in Figure 
2) is an outcome of the Grid Forum [20] second meeting proposed in [21]. The major 
components of this architecture are divided into passive and active components. The 
passive components are: 

• Resources are things that can be used for a period of time, and may or may 
not be renewable. They have owners, who may charge others for using 
resources and they can be shared, or exclusive. Resources might be explicitly 
named, or be described parametrically. Examples of resources include disk 
space, network bandwidth, specialized device time, and CPU time. 

• Tasks are consumers of resources, and include both traditional computational 
tasks and non-computational tasks such as file staging and communication. 

• Jobs are hierarchical entities, and may have recursive structure; i.e., jobs can 
be composed of subjobs or tasks, and subjobs may themselves contain 
subjobs. The leaves of this structure are tasks. The simplest form of a job is 
one containing a single task. 

• Schedules are mappings of tasks to resources over time. Note that we map 
tasks to resources, not jobs, because jobs are containers for tasks, and tasks 
are the actual resource consumers. 

The active components are: 

• Schedulers compute one or more schedules for input lists of jobs, subject to 
constraints that can be specified at runtime. The unit of scheduling is the 
job, meaning that schedulers attempt to map all the tasks in a job at once, and 
jobs, not tasks, are submitted to schedulers. 

• Information Services act as databases for describing items of interest to the 
resource management systems, such as resources, jobs, schedulers, agents, 
etc. We do not require any particular access method or implementation; it 
could be LDAP, a commercial database, or something else entirely. 

• Domain Control Agents can commit resources for use; as the name implies, 
the set of resources controlled by an agent is a control domain. This is what 
some people mean when they say local resource manager. We expect 
domain control agents to support reservations. Domain Control Agents are 
distinct from Schedulers, but control domains may contain internal 
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Schedulers. A Domain Control Agent can provide state information, either 
through publishing in an Information Service or via direct querying. 
Examples of domain control agents include the Maui Scheduler, Globus 
GRAM, and Legion Host Object. 

• Deployment Agents implement schedules by negotiating with domain control 
agents to obtain resources and start tasks running. 

• Users submit jobs to the Resource Management System for execution. 

• Admission Control Agents determine whether the system can accommodate 
additional jobs, and reject or postpone jobs when the system is saturated. 

• Monitors track the progress of jobs. Monitors obtain job status from the 
tasks comprising the job and from the Domain Control Agents where those 
tasks are running. Based on this status, the Monitor may perform outcalls to 
Job Control Agents and Schedulers to effect remapping of the job. 

• Job Control Agents are responsible for shepherding a job through the system, 
and can act both as a proxy for the user and as a persistent control point for a 
job. It is the responsibility of the job control agent to coordinate between 
different components within the resource management system, e.g. to 
coordinate between monitors and schedulers. 

We have striven to be as general as is feasible in our definitions. Many of these 
distinctions are logical distinctions. For example, we have divided the responsibilities 
of schedulers, deployment agents, and monitors, although it is entirely reasonable and 
expected that some scheduling systems may combine two or all three of these in a 
single program. Schedulers outside control domains cannot commit resources; these 
are known as metaschedulers or super schedulers. In our early discussions, we 
intentionally referred to control domains as “the box” because it connotes an 
important separation of “inside the box” vs. “outside the box.” Actions outside the 
box are requests; actions inside the box may be commands. It may well be that the 
system is fractal in nature, and that entire grid scheduling systems may exist inside the 
box. Therefore, we can treat the control domain as a black box from the outside. 

We have intentionally not defined any relationship between the number of users, 
jobs, and the major entities in the system (admission agents, schedulers, deployment 
agents, and monitors). Possibilities range from per-user or per-job agents to a single 
monolithic agent per system; each approach has strengths and weaknesses, and 
nothing in our definitions precludes or favors a particular use of the system. We 
expect to see local system defaults (e.g. a default scheduler or deployment agent) with 
users substituting their personal agents when they desire to do so. 

One can notice that the word queue has not been mentioned in this model; queuing 
systems imply homogeneity of resources and a degree of control that simply will not 
be present in true grid systems. Queuing systems will most certainly exist within 
control domains. 

Interaction of Components 

The interactions between components of the resource management system are shown 
in Figure 2. An arrow in the figure means that communication is taking place 
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between components. We will next describe, at a high level, what we envision these 
interactions to be. This is the beginning of a protocol definition. Once the high-level 
operations are agreed upon, we can concern ourselves with wire-level protocols. 
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Figure 2: Hierarchical Model for Grid Resource Management. 

We will begin with an example. A user submits a job to a job control agent, which 
calls an admission agent. The admission agent examines the resource demands of the 
job (perhaps consulting with a grid information system) and determines that it is safe 
to add the job to the current pool of work for the system. The admission agent passes 
the job to a scheduler, which performs resource discovery using the grid information 
system and then consults with domain control agents to determine the current state 
and availability of resources. 

The scheduler then computes a set of mappings and passes these mappings to a 
deployment agent. The deployment agent negotiates with the domain control agents 
for the resources indicated in the schedule, and obtains reservations for the resources. 
These reservations are passed to the job control agent. At the proper time, the job 
control agent works with a different deployment agent, and the deployment agent 
coordinates with the appropriate domain control agents to start the tasks running. A 
monitor tracks progress of the job, and may later decide to reschedule if performance 
is lower than expected. 

This is but one way in which these components might coordinate. Some systems 
will omit certain functionality (e.g. the job control agent), while others will combine 
multiple roles in a single agent. For example, a single process might naturally 
perform the roles of job control agent and monitor. 
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4. Abstract Owner (AO) Model 

Where is the grid, and who owns it? These puzzles are not unique to the grid. When 
one makes a long distance phone call, who "owns" the resource being used? Who 
owns the generators that create the electricity to run an appliance? Who owns the 
Internet? Users of these resources don’t care, and don’t want to care. What they do 
want is the ability to make an agreement with some entity regarding the conditions 
under which the resources can be used, the mechanisms for using the resources, the 
cost of the resources, and the means of payment. The entity with which the user deals 
(the phone company, power company, or ISP) is almost certainly not the owner of the 
resources, but the user can think of them that way abstractly. They are actually 
brokers, who may in turn deal with the owners, or perhaps with more brokers. At each 
stage, the broker is an abstraction for all of the owners and so it is with the grid. 

The grid user wants an abstraction of an entity that "owns" the grid, and to make an 
arrangement with that "owner" regarding the use of their resources, possibly involving 
a trade of something of value for the usage (which could be nothing more tangible 
than goodwill or the future use of their own resources). It is proposed here that each 
grid resource, ranging in complexity from individual processors and instruments to 
the grid itself, be represented by one or more “abstract owners” (abbreviated as AOs) 
that are strongly related to schedulers. For complex resources, an AO will certainly 
be a broker for the actual owners or other brokers, though the resource user doesn't 
need to be aware of this. (A resource user will hereafter be assumed to be a program, 
and referred to as a client. Human clients are assumed to use automated agents to 
represent him/her in negotiations with an AO.) The arrangement between the client 
and an AO for acquiring and using the resource can be made through a pre-existing 
contract (e.g. flat rate or sliding scale depending on time until resource available) or 
based on a dialogue between client and AO regarding the price and availability of the 
resource. 

The remainder of this AO proposal describes what an AO looks like (externally 
and internally), what a resource looks like, how a client negotiates with an AO to 
acquire a resource, how a client interacts with a resource, and how AOs can be 
assembled into other constructs which may more closely resemble traditional 
schedulers. This work is still in the high-level design stages, in hopes that it will draw 
out refinements, corrections, and extensions that might help it to become viable. 



General Structure of AO 

At its most abstract, an AO outwardly resembles a fast-food restaurant (see Figure 
3a). To acquire access to a resource from an AO that “owns” it, the prospective client 
(which may be another AO) negotiates with that AO through its Order Window. 
These negotiations may include asking how soon the resource may become available, 
how much it might cost, etc. If the prospective client is not happy with the results of 
the negotiations, it may just terminate negotiations, or might actually place an order. 
After being ordered, the resources are delivered from the AO to the client through the 
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Pickup Window. The precise protocol to be used for acquiring the resources is 
flexible and may also be negotiated at order time— e.g. the client may be expected to 
pick up the resource at a given time, or the AO may alert the client (via an interrupt or 
signal) when the resource is ready. Even if an order is placed (but the resource has not 
yet been delivered), the client may cancel the order through the order window. 




(b) AO is Resource Owner 



(e) Job Shop. 



Figure 3: Abstract Owner Model for Grid Resource Management Architecture. 

Little more is said here about the actual form of these “windows” except that they 
need to be accessible remotely, and must support a standard procedure-like interface 
in which values are passed to and returned from the window. Since interaction with an 
AO is likely to be rather infrequent and requires a relatively small amount of 
information flow, maximum efficiency is not necessarily required: CORBA or any 
number of other remote procedure invocation techniques can be used. 

For the purposes of this discussion, a resource is roughly defined as any 
combination of hardware and software that helps the client to solve a problem, and a 
task is that part of a problem that is specified by the client after the resource has been 
delivered ("picked up") from the AO. Note that, unlike some other definitions of 
"task", these tasks may be very simple (e.g. a data set to be analyzed or a message to 
be sent), more general (e.g. a process to be executed), or very complex (e.g. a 
complete multi-process program and/or set of programs or processes to be executed in 
some order). While AOs do not specifically deal with entities called "jobs", 
techniques for applying the AO approach to traditional job scheduling will be 
addressed in the last subsection. 



Resources can (and will) be regarded as objects, in the sense that they have an 
identity, a set of methods for initiating and controlling tasks, and attributes that serve 
to customize the resource. In general, the desired attributes will be determined during 
negotiation through the Order Window, when the client requests the resource, and 
will only be queried (not altered) after the resource is delivered. The methods may 
take many different forms, depending upon circumstances such as the type of 
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resource, availability of hardware protections, and whether the method is to be 
invoked locally or remotely. For example, access to a local memory resource may 
have virtually no method protocol interfering with standard memory access 
operations, while initiating a process on a distant processor may require more 
substantial method invocation protocol. A resource is relinquished by invoking its 
"relinquish" method (or by timing out). 

The external structure of an AO was formulated to allow any level of nesting. 
Internally, an AO will differ in structure depending on whether it is a broker or an 
owner (or a combination). A pure owner of a single physical resource might be very 
simple (see Figure 3b), where the "manager" includes the intelligence required to 
negotiate, keep the schedule, and deliver the resource. For a higher-level broker, it 
might be more complex (see Figure 3c). Here, AOl, A02, and A03 represent other 
Abstract Owners, each with an Order Window used by the Sales Representative, and a 
Pickup Window used by the Delivery representative. Though these subordinate AOs 
are shown within a single parent AO, there is no reason that this relation must be 
hierarchical; a single AO may provide resources to a number of different parent AOs, 
which may assemble these into more complex resources in different ways or for 
different clients sets or may support different protocols or strategies or policies. 



Grid Resources 

Three primary classes are proposed here to represent resources: Instruments, 

Channels, and Complexes. An Instrument is a resource which logically exists at some 
location for some specific period of time, and which creates, consumes, or transforms 
data or information. The term "location" may be as specific or general as the situation 
merits. A Channel is a resource that exists to facilitate the explicit transfer of data or 
information between two or more instruments, either at different locations, or in the 
same location at different times (acting as sort of a temporary file in that case), or 
instruments which share space-time coordinates but have different protection 
domains. A Channel connects to an Instrument through a Port (on the instrument). A 
Complex is nothing more than a collection of (connected) Channel and Instrument 
resources. 

Some important sub-classes of the Instrument class are the Compute instrument, 
the Archival instrument, and the Personal instrument. The Compute instrument 
corresponds to a processor or set of processors along with associated memory, temp 
files, software, etc. Archival Instruments (of which a permanent file is one sub-class) 
correspond to persistent storage of information. Personal instruments are those that 
are assumed to interface directly to a human being, ranging from a simple terminal to 
a more complex CAVE or speech recognition/synthesis device, and its specification 
may include the identity of the person involved. Of course, the Instrument class is 
also meant to accommodate other machines and instruments such as telescopes, 
electron microscopes, automatic milling machines, or any other sink or source for grid 
data. 

As stated, an instrument exists in a location, and its methods may need to be called 
either locally (from the instrument itself) or remotely. For example, if a (reference to 
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a) Compute instrument is acquired from an AO, the potentially distant process may 
want to invoke a "load_software" method to initiate a program on the resource. This 
new program may then want to invoke methods to access the temporary files or ports 
associated with the resource. Since the latter accesses will be local and must be 
efficient, it is desirable to provide separate method invocation protocols for remote 
and local method invocation. Moreover, remote method invocations (RMIs) may 
themselves require the use of intermediate communication resources between the 
client and the resource, perhaps with associated quality of service (QoS) constraints. 

To facilitate remote method invocations, any port(s) of an instrument can be 
specially designated as an RMI port. Such ports will have the appropriate RMI 
protocol handlers assigned to them. This designation is an attribute of the port— i.e., 
specified at resource negotiation time, through the "order window", just as 
authorization and notification style are. Methods can be invoked through such a port 
either by connecting a channel to the port and issuing the RMI request through the 
channel or in a connectionless mode by specifying the object and port. The former 
approach is best when issuing repeated RMI calls or when QoS is desired for RMI 
calls, the latter is best for one-time-only calls such as initializing an instrument which 
has just been acquired from an AO. 



Negotiating with an AO 

When negotiating through the order window, the client first effectively creates a 
"sample" resource object of the appropriate structure and assigns each attribute either 
(1) a constant value, (2) a "don’t care" value, or (3) a variable name (which will 
actually take the form of, and be interpreted as, an index into a variable value table). 
If the same variable name is used in multiple places, it has the effect of constraining 
those attributes to have the same value. An example of this is to use a single variable 
to specify the "beginning time" attribute on several Instrument objects to cause them 
to be co-scheduled. Another is to specify variables for Instruments’ object IDs, then to 
use those same variables when specifying the endpoints of the channels between 
them. The client may also specify simple constraints on the variables in a separate 
constraint list. 

Usually, the values in the variable value table are filled and returned by the AO 
when the resource is acquired, but the client can designate some subset of those 
variables as negotiating variables. For these, the AO will propose values during 
negotiation, which the client can then examine to decide whether or not to accept the 
resource. (If accepted, these values essentially become constants.) In general, it is 
quicker for the client to specify additional constraints instead of using negotiation 
variables, allowing the decision on suitability to be made wholly within the AO, but 
negotiating variables can help when more complex constraints are required or when a 
client must decide between similar resources offered by different AOs. 

In all, submissions to the Order Window from the client include the sample object 
attributes, the variable constraint list, a Negotiation Style, a Pickup Approach, an 
Authorization, a Bid, and a Negotiation ID. The Negotiation Style specifies whether 
the AO is to schedule the resource immediately (known as “Immediate”), or is to 
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return a specified number of sets of proposed values for the negotiation variables 
(known as “Pending”), or is to finish scheduling based on an earlier-returned set of 
negotiation variable values (known as “Confirmation”), or is to cancel an earlier 
Pending negotiation (known as “Cancel”). The Pickup Approach specifies the 
protocol to be used between the AO and client at the Pickup Window — i.e. whether 
the AO will alert the client with a signal, interrupt, or message when the resource 
becomes available, or the client will poll the Pickup Window for the resource, or the 
client can expect to find the resource ready at the window at a specified time. The 
Authorization is a capability or key which allows the AO to determine the authority of 
the client to access resources (and to bill the client accordingly when the resources are 
delivered). The Bid is a maximum price that the client is willing to pay for the 
resource, and may denote a pre-understood algorithm (or “contract”) specifying how 
much the resource will cost under certain conditions. The Negotiation ID serves as a 
“cookie”, and is passed back and forth between the client and AO to provide an 
identity and continuity for a multi-interaction negotiation, and continuity between the 
negotiation of a resource and the ultimate delivery of the resource through the Pickup 
Window. (A zero Negotiation ID designates the beginning of a new negotiation.) 

If a Pending negotiation style is specified, the AO returns a value table containing 
sets of proposed values for the negotation variables, and an “Ask” price for each set. 
The intent of the Ask price is to inform the client of a sufficient Bid price to be used 
when requesting the resource, but the AO may conceivably accept even lower Bid 
prices depending upon the specific situation. For all negotiations, the AO returns a 
return code informing the client of the success of the operation, a Negotiation ID, 
(equal to that submitted, if it was nonzero), and an expiration date for the Negotiation 
ID. A single negotiation can continue until the Negotiation ID expires or a 
Negotiation Style other than “pending” is specified. 

On a successful Immediate or Confirm request, the client can then submit the 
Negotiation ID to the Pickup Window, (at a time consistent with the Negotiation 
Style), to retrieve the resource. The Pickup Window returns the resource object, the 
variable value table, and a return code. Although the returned resource is logically an 
object, it is assumed that any attribute values that the client is concerned with are 
being returned in the Variable Value table, so the resource object just takes the form 
of a handle to access the resource object's methods. 



Job Shops 

AOs apparently perform only part of the standard job scheduling process — i.e. 
acquiring a resource — leaving the remainder to the client — i.e. assigning tasks to the 
resources and monitoring their completion and/or cleanup, often in sequential and 
dependent steps. But this is only partially true. Recall that a Compute Instrument, 
exclusive of the task that is eventually assigned to it by the client, may consist of both 
hardware and software components. While the software components often serve to 
create an environment in which the eventual task will execute (such as libraries or 
interpreters), they may also be compilers and/or complete user programs. That is, the 
Compute Instrument itself can be defined as a processor executing a specific program. 
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The task assigned to such an instrument may be a data-set or source code to be read 
by that program (or compiler), or even nothing at all if the resource is completely self- 
contained. Since the AO is responsible for preparing the instrument for delivery 
through the Pickup Window and recovering it after it has been relinquished, it is 
indeed responsible for initiating this software and cleaning up after it. 

The traditional sequential nature of job steps has resulted from the prevalence of 
uniprocessors and traditional sequential thinking, but it is already common for parallel 
“make” utilities, for example, to exploit potential parallelism in job-like scripts. 
Similarly, in an AO resource, compute instruments running the individual “job steps” 
can be connected to communicate through channels, allowing them to be scheduled 
locally or in a distributed fashion, and scheduled sequentially or in parallel by the AO, 
subject to the dependences dictated by the channels and the QoS constraints assigned 
to those channels by the client. In this way, a job can be represented as a Complex 
Instrument in the AO infrastructure, where it will be scheduled. 

Even with these capabilities, there is always the possibility that a more traditional 
job scheduler is required. In such a case, consider a new construct called a job shop, 
which uses AOs only to acquire resources, as shown in Figure 3d. See Figure 3e for 
an example of the internals of a standard job shop. The job shop primarily comprises 
“estimator” and “executor”, much like an auto repair shop. The estimator deals with 
the customer to help determine how soon the job might be done and how much it 
might cost, requests the resources needed from the grid AO (through its order 
window), and records what needs to be done (in a job queue) when the resources are 
ready. The executor takes ready resources from the AO delivery window, dequeues 
the associated work from the job queue, builds any necessary environment for those 
tasks (e.g. telling message passing routines which channels to use), initiates tasks, 
collects answers, and notifies and returns the answer to the client. 

Nesting job shops (or traditional job schedulers in general) is not as natural as 
nesting AOs, primarily because a job shop provides little feedback to the client until it 
has acquired resources and assigned tasks to them. This means that tasks are often 
assigned to some resources even before others have been allocated, and may be 
shipped around to where the resources are, long before they are needed there. 

AO Summary 

There are many remaining gaps in the above description, both in detail and in 
functionality. For example, little has been said about how any client, whether an end- 
user or another AO, will find AOs that own the desired kind of resources. Certainly, 
one approach is to imagine a tree of AOs (as in Figure 3c), with the client always 
interacting with the root AO, but it is unrealistic to consider this tree as being 
hardwired when residing in an environment as dynamic as a computational grid. 
More likely, existing Internet protocols can be adapted for this purpose, and an AO 
might have a third “business dealings” window to facilitate them. Before an approach 
like AO has any likelihood of acceptance in a large community, it must address many 
such challenges. Even a potentially useful and well-defined (successfully prototyped) 
AO protocol will not be viable unless it can coexist with other contemporary 
approaches. It is therefore important to understand how AOs and constructs in these 
other systems can build upon one another and mimic one another. 




30 



R. Buyya, S. Chapin, and D. DiNucci 



5. Market Model 



The resources in the grid environment are geographically distributed and each of them 
is owned by a different organisation. Each of them has its own resource management 
mechanisms and policies and may charge different prices for different users 
necessitating the need for the support of computational economy in resource 
management. In [34], we have presented a number of arguments for the need of an 
economy (market) driven resource management system for the grid. It offers resource 
owners better “incentive” for contributing their resources and help recover cost they 
incur while serving grid users or finance services that they offer to users and also 
make some profit. This return-on-investment mechanism also helps in 
enhancing/expanding computational services and upgrading resources. It is important 
to note that an economy^ is one of the best institutions for regulating demand and 
supply. Naturally, in a computational market environment, resource users want to 
minimise their expenses (the price they pay) and owners want to maximise their 
retum-on-investment. This necessitates a grid resource management system that 
provides appropriate tools and services to allow both resource users and owners to 
express their requirements. For instance, users should be allowed to specify their 
“QoS requirements” such as minimise the computational cost (amount) that they are 
willing to pay and yet meet the deadline by which they need results. Resource owners 
should be allowed to specify their charges — that can vary from time to time and users 
to users — and terms of use. Systems such as Mariposa [17], Nimrod/G [3], and JaWS 
[16], architect their user service model based on the economy of computations and it 
is likely that more and more systems are going to emerge based on this concept. 
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Figure 4: Market Model for Grid Resource Management Architecture. 



^ We use terms “economy” and “market” interchangeably. 
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The market model for grid resource management captures the essentials of both 
hierarchical and AO model presented above. Many of the contemporary grid systems 
fit to the hierarchical model and AO appears to be futuristic, but points out the need 
for economy in computation implicitly. The issues discussed in the hierarchical model 
apply to the market model, but it emphasizes the use of economic based resource 
management and scheduling. One of the possible architectures for grid resource 
management based on computational market model is shown in Figure 4. Resource 
trading model can vary depending on the method/protocol used (by trade manager) in 
determining the resource access cost. 

The following are the key components of economy-driven resource management 
system: 

• User Applications (sequential, parametric, parallel, or collaborative 
applications) 

• Grid Resource Broker (a.k.a., Super/Global/Meta Scheduler) 

• Grid Middleware 

• Domain Resource Manager (Local Scheduler or Queuing system) 



Grid Resource Broker (GRB) 

The resource broker acts as a mediator between the user and grid resources using 
middleware services. It is responsible for resource discovery, resource selection, 
binding of software (application), data, and hardware resources, initiating 
computations, adapting to the changes in grid resources and presenting the grid to the 
user as a single, unified resource. The components of resource broker are the 
following: 

• Job Control Agent (JCA): This component is a persistent central 
component responsible for shepherding a job through the system. It takes 
care of schedule generation, the actual creation of jobs, maintenance of job 
status, interacting with clients/users, schedule advisor, and dispatcher. 

• Schedule Advisor (Scheduler): This component is responsible for resource 
discovery (using grid explorer), resource selection, and job assignment 
(schedule generation). Its key function is to select those resources that meet 
user requirements such as meet the deadline and minimize the cost of 
computation while assigning jobs to resources. 

• Grid Explorer: This is responsible for resource discovery by interacting 
with grid-information server and identifying the list of authorized machines, 
and keeping track of resource status information. 

• Trade Manager (TM): This works under the direction of resource selection 
algorithm (schedule advisor) to identify resource access costs. It interacts 
with trade servers (using middleware services/protocols such as those 
presented in [4]) and negotiates for access to resources at low cost. It can 
find out access cost through grid information server if owners post it. 

• Deployment Agent: It is responsible for activating task execution on the 
selected resource as per the scheduler’s instruction. It periodically updates 
the status of task execution to JCA. 
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Grid Middleware 

The grid middleware offers services that help in coupling a grid user and (remote) 
resources through a resource broker or grid enabled application. It offers core services 
[12] such as remote process management, co-allocation of resources, storage access, 
information (directory), security, authentication, and Quality of Service (QoS) such as 
resource reservation for guaranteed availability and trading for minimising 
computational cost. Some of these services have already been discussed in the 
hierarchical model, here we point out components that are specifically responsible for 
helping out in offering computational economy services: 

• Trade Server (TS): It is a resource owner agent that negotiates with 
resource users and sells access to resources. It aims to maximize the resource 
utility and profit for its owner (earn as much money as possible). It consults 
pricing algorithms/models defined by the users during negotiation and 
directs the accounting system to record resource usage. 

• Pricing Algorithms/Methods: These define the prices that resource owners 
would like to charge users. The resource owners may follow various policies 
to maximise profit and resource utilisation and the price they charge may 
vary from time to time and one user to another user and may also be driven 
by demand and supply like in the real market environment. 

• Accounting System: It is responsible for recording resource usage and bills 
the user as per the usage agreement between resource broker (TM, user 
agent) and trade server (resource owner agent) [19]. 



Domain Resource Manager 

Local resource managers are responsible for managing and scheduling computations 
across local resources such as workstations and clusters. They are even responsible 
for offering access to storage devices, databases, and special scientific instruments 
such as a radio telescope. Example local resource managers include, cluster operating 
systems such as MOSIX [18] and queuing systems such as Condor [12]. 



Comments 

The services offered by trade server could also be accessed from or offered by grid 
information servers (like yellow pages/advertised services or posted prices). In this 
case a trade manager or broker can directly access information services to identify 
resource access cost and then contact resource agents for confirmation of access. The 
trade manager can use these advertised/posted prices (through information server) or 
ask/invite for competitive quotes (tenders) or bids (from trade server/resource owner 
agents) and choose resources that meet user requirements. 

From the above discussion it is clear that there exist numerous methods for 
determining/knowing access cost. Therefore resource trading shown in Figure 4 is one 
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of the possible alternatives for computational market model and it can vary depending 
on, particularly, trading protocols like in real world economy. Some of the real-world 
trading methods that can also be applied for computational economies include: 

• Advertised/posted prices (classified advertisements) through information server 

• Commodity exchanges 

• Negotiated prices 

• Call for (closed) tenders 

• Call for (open) bids 

Each of these methods can be applied in different situations for computational 
economies and they create a competitive computational market depending on the 
demand and supply and the quality of service. The mechanism for informing resource 
owners about the availability of service opportunities can vary depending on its 
implementation. One of the simplest mechanisms is users (buyers) or/and resource 
owners (sellers or their agents renting/leasing computational services) make available 
or post/publicise their requirements in a known location (for instance, “exchange 
centre, share market, or grid information service directory”). Any one or all can 
initiate computational service trading. Through these mechanisms one can perform 
the following types of actions like in real world market economies: 

• Users can post their intentions/offers to buy access to resources/services 
(e.g., “20 cluster nodes for 2 hours for $50); 

• Resource owners/grid nodes/providers/agents can post offers to sell (e.g., 
systems like NetSolve can announce “we solve 1000 simultaneous linear 
equations for $5”); 

• Users/resource owners can query about current opportunities including 
prices/bids and historical information. 

The different grid systems may follow different approaches in making this happen 
and it will be beneficial if they are all interoperable. The interoperability standards 
can be evolved through grid user/developer community forums or standardization 
organisations such as GF [20] and eGRID [22]. 



6. Discussion and Conclusions 



In this paper we have discussed three different models for grid resource management 
architecture inspired by three different philosophies. The hierarchical model captures 
the approach followed in many contemporary grid systems. The abstract owner shows 
the potential of an order and delivery approach in job submission and result gathering. 
The (computational) market model captures the essentials of both hierarchical and 
abstract owner models and uses the concept of computational economy. We have 
attempted to present these models in abstract high-level form as much as possible and 
have skipped low-level details for developers to decide (as they mostly change from 
one system to another). Many of the existing, upcoming and future grid systems can 
easily be mapped to one or more of the models discussed here (see Table 1). It is also 
obvious that real grid systems (as they evolve) are most likely to combine many of 
these ideas into a hybridized model (that captures essentials of all models) in their 
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architecture. For instance, our Grid Economy [4] is developed as a combination of 
Globus and GRACE services based on a (hybridized) market model. 

The importance of market models for grid computing is also reported in the journal 
of Scientific American [23]: “So far not even the most ambitious metacomputing 
prototypes have tackled accounting: determining a fair price for idle processor cycles. 
It all depends on the risk, on the speed of the machine, on the cost of communication, 
on the importance of the problem— on a million variables, none of them well 
understood. If only for that reason, metacomputing will probably arrive with a 
whimper, not a bang”. We hope that (our proposed) computational market model for 
grid systems architecture along with others will help the arrival of computational 
grids with a big bang (not a whimper) ! 
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2.1 Overview of System Components 
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Fig. 1. 
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Basic System Protocols 
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3 Resource Allocation 



o o o o m 

mo o m 

m o 

performance vectors 
o mo o m 

o m o o 

abort ratio o o om 

o o 

o o m o 

mo o 
o o m o 



m 



m 



o o 
o o 

o 

o om 
m o 

o 
o 



o o 
m 



o m 
o o 



o o 



moo 



o m 

o 

o 

o om 

o 
o 



m 

o 



m m m o o m 

m mo 



o 

o 



o 

flo 

o 

om o 
m 

) 

o m o 



m 



o 

o 



o 

o 



o o 




Table 1 
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4.1 The Generic Master — Slave Model 
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public interface MSControl extends gr . jaws . SchedulerControl { 
public void startComputation( Object partitionParams , 

Object orderParams ) throws java.rmi .RemoteExc option, 
AlreadyStartedException, ExitingException; 
public void stopComputationO throws java.rmi .RemoteExc option, 
NotStartedException, AlreadyStoppingException; 
public Object [] getResultsO throws java. rmi .Remo toExc option; 
public void stopScheduler () throws java.rmi .RemoteException; 

} 

public abstract class MSScheduler 

extends gr. jaws. models. msg.MsgScheduler implements MSControl { 
public abstract Object [] 

createPartitions ( Object partitionParams ); 
public abstract void placeOrder( Object orderParams ); 

> 

public abstract class MSTask 

extends gr .jaws. models. msg.MsgTask { 

protected abstract Object processPartition( Object partition ); 
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4.2 A Sample Client Application 



o o 




m o 


o 




o 


o 








o 




o 




om 






o o m 












o 


o o 


o o m o 




o 






om 


o om 


o 


o 


o o 


mo 










o 










o o 




o 


o 


o 






o o 






o 


o 






o m 




o 


o 






o 










o o 




m o 




om 


o m 




o 




om 


o 


o o 












o 






o 






o 




o o 


Upload 


o o o 




o 


o 










uplf = 


(gr .jaws .Uploadinterf ace) 









java. rmi .Naming, lookup ( ''//host .domain.gr : 1200/” + 




gr . j aws . Uploadinterf ace . uplf Name ) ; 
uplf .uploadCodeC user, passwd, ”f ractalApp” , "fractalApp . jar” , 
jarContents , "FractalScheduler ” , false ); 

o o o o 

mm o Upload o o o 

controllnterf ace = (FractalControl) 

uplf . instantiateScheduler ( user, passwd, "fractalApp” ); 
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public class FractalScheduler 
extends gr . jaws .models .ms .MSScheduler { 

public Object [] createPartitions ( Object pars ) { 

FractalPartitionsDef fp = (FractalPartitionsDef ) pars; 
Object [] partitions = 

new FractalPartition [ pars . totalPartitions] ; 
for( int i=0; i<part it ions . length; i++ ) 

partitions [i] = /* Code for calculating partition i */ 
return partitions; 

} 

public void placeOrder( Object orderPars ) { 

FractalOrderParams p = (FractalOrderParams) orderPars; 
placeBuyOrder ( p. units, p. price, p. duration, 

0,0, null, null, 0.0 ); 

} 

} 

public class FractalTask extends gr . jaws .models .ms .MSTask { 
public Object processPartition( Object partition ) { 
FractalPartition p = (FractalPartition) partition; 

/* Code for calculating fractal partition p */ 
return myPart ; 

} 

> 
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5 Related Work 
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Abstract. Resource management in the typical Grid environment based on 
multi-MPP systems or clusters today still is one of the challenging problems. 
We will present MeSch, a solution for the problem of resource allocation and 
job scheduling in a distributed heterogeneous environment. MeSch has been 
implemented and tested successfully in the heterogeneous multi-MPP 
environment of GMD’s Institute for Scientific Computing and Algorithms. 
MeSch allows users to access simultaneously, through a single request, 
heterogeneous resources distributed across the linked systems. This is possible 
either through explicit demands for different resources or through implicit 
scheduling of resources resulting from interpretation of requests. The 
scheduling system is available for both batch and interactive usage of resources. 
MeSch is implemented based on locally available scheduling facilities thus 
respecting the different scheduling systems and policies of the computing 
centers in the Grid. 



1 Introduction 

Resource management and job scheduling in the typical Grid environment based on 
multi-MPP systems or clusters today still is one of the challenging problems. 
Especially in a geographically distributed and heterogeneous environment it turns out, 
that although scheduling tools and policies are available for each subsystem, there is a 
lack of global resource management and thus, resource allocation is far away from 
being performed automatically. On the contrary: a substantial amount of human 
communication on all levels is necessary to partition the application, locate resources, 
and observe the behavior of distributed modules. 

We will present MeSch, a light-weight solution for the problem of resource 
allocation and job scheduling in a distributed heterogeneous environment. The same 
way, a Grid application uses the Grid resources as a metacomputing environment 
allowing to make use of more than one MPP system or cluster, MeSch leads to the 
idea of building a metascheduler, which takes the burden of resource allocation for a 
metajob. The approach here is to build the metascheduler such, that it can use 
schedulers of all subsystems involved for all co-ordination and resource allocation 
tasks. 

We will discuss the requirements, a local scheduler should have in order to be 
suitable for global scheduling. In addition, we describe the basic algorithm of a 
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prototype MeSch metascheduler which allows co-ordination of the whole scheduling 
process during the application lifetime including resource allocation. The algorithm 
was especially designed to allow simultaneous access to the requested resources, a 
requirement typically needed by parallel applications. 



2 State of the Art 

Until now, the only solution to overcome these problems is to use scheduling systems 
that are able to completely handle resource management for all resources involved. 
However, trying to use heterogeneous environments as they are becomes difficult if 
such attempts will be based on a single task approach as a regular service, without any 
need to change local administration rules and policies. Or, for example, to introduce 
local components like the GRAMs of the Globus [6] system building an additional 
encapsulating layer that interfaces to local resource management systems. However, 
this approach implies an “overhead” which may not be desirable for smaller 
computing centers. 

We are well aware that there are other powerful systems like Globus, Legion or 
Unicore [6,7,5] providing a broader range of integrated tools. These projects are part 
of the foundations of the Grid and will be propagated more and more as the Grid 
evolves. 

MeSch, however, is directed to a simple and efficient way of bundling distributed 
computing resources for the “bigger” parallel jobs of a user without the need to 
install one of the systems mentioned above. 



3 Requirements for Global Resource Management 



3.1 MeSch Scheduler Hierarchy 

The MeSch approach handles resource allocation as a global task which can be 
divided into subtasks that may be delegated to co-operating schedulers of the 
subcomponents of a Grid environment. Ideally, we won’t discard local schedulers; 
instead, we build the metascheduler on top of the local ones. This allows us to build a 
hierarchy of schedulers. 

In the same sense as a traditional scheduler maintains the nodes/processors as 
allocatable resources, the metascheduler does with systems (or partitions of systems). 
The advantage is, that all subsystems can act in their usual way with their own policy. 
Moreover, allocation of processors remains in the responsibility at the local system 
level and is not explicitly done by the metascheduler. As subsystems remain 
responsible for allocation, the local use of subsystems is not affected. No restriction is 
imposed on local scheduling strategies and administrative policies. 

The MeSch approach does not impose any restrictions on the type of Grid system: 
they may be homogeneous or heterogeneous, geographically distributed, any 
combination of MPP, cluster, and dedicated systems. 
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Meta-System: 



Meta-Scheduler 

/\ 




Fig. 1. Meta- Scheduler as a Hierarchy of Schedulers 



3.2 Requirements for Local Schedulers 

However, MeSch requires some local scheduler attributes in order to be able to take 
over the burden of the overall scheduling task’s global synchronization 

In order to provide simultaneous access to required resources, methods of getting 
reliable information about suitable time slots must be available. This information 
enables MeSch to determine a common time slot on all Grid components that are 
required for a Grid application. First subscheduler suggestions about available time 
slots in general will not lead to a solution for the complete metajob. Thus, we must be 
able to ask for alternative time slots to have a chance to determine a commonly agreed 
time slot for simultaneous access. 

If a commonly suitable time slot can be determined, the MeSch metascheduler 
must be able to inform each subscheduler to fix the time slot and to guarantee that it 
will allocate required resources at the agreed start time for the agreed time interval. 

MeSch synchronization management requires several iterations of interaction with 
subschedulers to find a solution for a suitable time slot. Obviously, offered time slots 
must be (pre-)reserved by subschedulers, while they are under consideration for 
suitability. An allocation agreement protocol eases the synchronization process by 
defining a set of states a job may have from an scheduling request to its final 
execution. 
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Each subscheduler, that can be modified to follow the allocation agreement 
protocol for metajob scheduling requests, is usable in a MeSch controlled Grid 
environment. Of course, if the local scheduler may not be modified to handle the 
allocation agreement protocol this may be implemented in a wrapper, given the local 
scheduler at least provides mandatory attributes like the ability to start jobs at a given 
time and to do estimation of future resource allocation. 
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3.3 Allocation Agreement Protocol 

The goal of the allocation agreement protocol is to agree with all local schedulers in 
the allocation of local resources simultaneously for the same time interval. 

We assume a specialized L-submit operation for each local scheduler, which 
accepts requests from the meta M-submit. L-submit knows that incoming requests are 
for a metajob, and thus it enables the time agreement protocol. This essentially means, 
that some additional information such as state, meta-identifier, etc. are to be 
maintained, and that the local scheduler knows about the time agreement protocol for 
this job. 

M-submit calls the L-submit operation which sets the initial Accepted/Rejected 
state. A local preview will calculate a proposed CouldRun time, which the 
metascheduler will change to ShouldRun as a result of analyzing all meta 
components. The local scheduler will agree with a WillRun answer. 

The ReadyToRun state, set by the local scheduler, indicates that local allocation is 
being prepared. The local resources are allocated, once the metascheduler has found 
common agreement by indicating the Go state. 

If the analysis of offered time slots does not yield a solution, the metascheduler 
will go back in the protocol line and make better proposals. For the local schedulers: 
if they cannot fulfill a metascheduler request for a dedicated time slot, they make new 
proposals due to their local schedule policy. 



4 A Prototype Implementation 



4.1 Using EASY as a Modeling Tool 

MeSch has been implemented and tested successfully in the heterogeneous multi- 
MPP environment of GMD’s Institute for Scientific Computing and Algorithms. The 
available MPP environment allowed to attack the problem in a heterogeneous 
environment without having to deal with the problem of geographical distribution. 
Our prototype environment consists of an IBM SP2 with 34 nodes and NEC Cenju-4 
with 64 processors. 

Both systems use an enhanced EASY scheduler [1,2], which has been modified to 
fulfill the time agreement protocol. 

The EASY concept is basically built on a “backfill” strategy. Our enhancements 
ensure, that even if a job has low precedence, it will be started, if its requirements do 
not have any implications for jobs of higher priority. The idea is to optimize system 
throughput by avoiding idle resources. As a side effect of this full backfill strategy, 
for all skipped jobs an estimated start time is available. This allowed us to provide a 
complete job estimation list, which is available to users and informs them about worst 
case start/stop times. 

With this preview feature for each job, we have one basic property necessary to 
build MeSch: it enables MeSch to get actual information about scheduled time slots. 
In addition, an EASY job may be scheduled with a StartAt option. 
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For our prototype MeSch, EASY as a local scheduler fulfills the basic 
requirements for information gathering. The allocation algorithm had to be changed to 
support the time allocation algorithm. For the MeSch, in addition to new submit and 
release operations, only a pre viewer had to be implemented, which keeps control of 
the states of all local schedulers involved. The prototype allows submission of jobs in 
an EASY-like way. 



5 An Example 

In our prototype implementation, Meta components of the environment are specified 
explicitly by referencing the number of nodes of the respective systems, as in the 
example 

msubmit -nVO -t300 -rSysA [30] , SysB [40] -bmyjob 

The batch job myjob requires 300 minutes of CPU time, 30 nodes of the SysA, as 
well as 40 nodes of SysB system. The command above will be separated by MeSch 
into the partial submissions 

SysA: Isubmit -n30 -t300 -bmyjob 

SysB: Isubmit -n40 -t300 -bmyjob 

Each of the local systems will handle the respective request according to its local 
policy. Backfilling allows to find worst case time slot. According to the time 
agreement protocol MeSch is able to accept proposals from the local schedulers or to 
make new proposals until all schedulers involved agree to a common time slot. 

As an example of the time agreement protocol, assume SysB accepting the metajob 
request. Refer to figure 2 for an overview of the state names and sequence. 

With the EASY internal preview facility, the local scheduler of SysB signals a 
CouldRunAt time interval starting at time ts. The scheduler assures availability of 
requested resources to the metasystem. With the ts information from all local 
schedulers, the MeSch scheduler can determine the max(tsi) start time. A 
ShouldRunAt proposal to the local scheduler in general signals that for the reason of 
simultaneous access a later time than ts is favored by MeSch. A local scheduler may 
reject the request, proposing a new time slot and signalling the CouldRunAt state; it 
may accept the new time and signal a WillRunAt state for the respective part of the 
metajob. This iterative procedure leads to a commonly accepted WillRunAt state for 
each part of the metajob. 

At arrival of the agreed time slot, the local scheduler will signal a ReadyToRun to 
MeSch, which - if everything is right - will allow local scheduling to allocate the 
required resources immediately by signalling a Go state. 

Each of these states may be rejected: Whenever the metascheduler is not able to 
accept a local scheduler proposal, it resets to the WillRunAt proposing an optimal 
time slot up to its knowledge about all local scheduler proposals. Whenever a local 
scheduler is not able to accept a MeSch proposal, it resets to the CouldRunAt state 
proposing the next available time slot for the metajob part, that can be guaranteed 
according to the local scheduling strategy and administrative policy. 
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6 Conclusion 

The MeSch approach is a prototype metajob scheduler approach for a Grid 
environment. Its main advantage is that local scheduling policies are not affected by 
Grid jobs. The meta job scheduling can be viewed as using local schedulers as 
resource managers in a scheduler hierarchy. However, for an easy to implement 
allocation agreement protocol, local schedulers must provide a run time estimation 
facility for submitted jobs and accept and guarantee dedicated start time specification. 

The practicability of the approach has been demonstrated by a prototype 
implementation based on an enhanced EASY scheduler version. 

Currently we are investigating how to implement scheduling for visualization 
devices such as a workbench for applications with real-time visualization demand. 
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Abstract. Image processing applications (IP A) requirements can be best met by 
using the distributed environment. The authors had developed an environment 
over a network of VAXA^MS and Unix for distributed image processing. The 
efficiency was as high as 90-95%. This paper presents an augmentation and 
generalization of the environment using Java and web technology to make it 
truly system independent. Although the environment has been tested using 
image processing applications, the design and architecture is truly general so 
that it can be used for other applications, which require distributed processing. 

Keywords: DEDIP, Parallel Image Processing, Distributed image processing 



1. Introduction 

Image processing applications (IP A) require processing on large volumes of data. 
These also require various types of resources like high-resolution graphic displays, 
drives for magnetic tape/cartridge/floppies, optical disk, database, etc. The resource 
requirement changes with time due to the availability of new and better resources. 
Hence, it is not possible to assume the availability of all resources on a single system. 
The distributed processing environment not only helps in optimum utilization of such 
resources but also helps in achieving better throughput using multiple processors in 
parallel. However, if one has to use multiple heterogeneous machines in a network to 
execute a set of tasks, one may have to execute a tedious set of commands. It is not 
possible to expect an operator to carry out such operations on a regular basis. 
Moreover in such a system, any error, occurring either during data transfer or during 
processing, creates difficulties for the operator. 

Development of an application having built-in automated data transfer, capability 
of using multiple machines in parallel, and robust error handling is a challenging job. 
This paper presents the authors’ contribution in providing a tool that makes such a 
development very easy. 
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The research work carried out by eminent computer professionals [1-10] focused 
on the parallel-processing experts’ needs. The image processing applications are 
developed by scientists (mathematicians, physicists, remote sensing experts, etc) not 
by parallel processing experts. Smooth operational environment. Operational setup, 
and ease of use are the critical issues for scientists compared to parallel processing 
experts. 

We focused on the requirement of this vast community. We presented a full 
fledged Development Environment for Distributed Image Processing (DEDIP) that 
makes the development & operationalization of distributed applications very easy 
[11]. This paper presents a WebDedip, which is redesign and generalization of 
DEDIP to make it more user friendly and truly heterogeneous. 



2. Generalization and Extension 

The WebDedip has a novel design, which explores object oriented modeling 
technique in the web domain. The new model uses three-tier architecture instead of 
master-slave one. The DEDIP had provided GUI only on the host system. The 
WebDedip provides browser based GUI on all nodes connected to the system. It 
enables the user to use the WebDedip from any system on Internet. Thus, it provides 
the roaming profile to application designers, operation managers and operators. Users 
have to edit a few text files to configure their application in DEDIP. WebDedip has 
made this task easier by adding new user friendly GUIs. The augmented design 
addresses all the important redundancy issues making the WebDedip fault tolerant. 



3. WebDedip Overview 

The WebDedip has a three tier architecture; GUI, DedipServer and Agents, as shown 
in figure- 1 . 

The GUI is the web enabled graphical user interface to make the entire user- 
interaction truly system independent. It supports various Java Applets for application 
configuration, application building, application operation initiation, application 
progress monitoring, and session controlling. The user initiates the interaction by 
visiting a predefined site using a standard browser. The standard web server loads the 
required GUI on the web browser. 

It has a back-end DedipServer running on the web site. When the GUI submits the 
request to the DedipServer, it reads the application configuration information from the 
configuration file. The DedipServer initiates the execution of the first process in the 
interdependency chart. Normally, most of the applications have a single starting 
process. If any application has multiple starting processes, it initiates execution of all 
such independent processes. It informs the agent(s) on the target node to start the 
execution of the process. The agent sends the status information back to the 
DedipServer when the process is completed. The DedipServer finds out the dependent 
processes on the successful completion of a process and initiates the execution of each 
such process. The required files are transferred from one node to another. WebDedip 
has a callable library in Java to interface with the ETP server [12] that helps in 
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transferring files. The required process is automatically inserted in the configuration 
when IP designer inserts the 10 dependency information (figure-5) between two 
processes. 

The DedipServer stores complete information about all the applications configured 
on the web site. The DedipServer exchanges information with the 
DedipBackupServer making the model fault tolerant. 




Fig. 1. WebDedip Model 



The task of the agent is very simple. It accepts requests from the DedipServer, 
executes them and provides the status information when completed. It has process 
building (compilation), execution, and monitoring capabilities. It can schedule 
multiple processes in parallel. It does not control the synchronization among the 
parallel processes, instead it depends on the DedipServer for this job. It treats each 
process as a single independent entity. 

The WebDedip not only caters to the requirements of the application designer, but 
also addresses all the requirements of the operation manager as well as operators. The 
application configuration and building is a privileged task, carried out either by the 
application designer or operation manager. During the regular operations, the operator 
can initiate any required application, monitor progress, do error handling, and 
terminate the application, if necessary. The web server capability is used to provide 
the required access control rights. 

Object-oriented modeling (implemented in Java) is used for the design of the 
WebDedip [13]. The application is modeled as an object while the process is modeled 
as an embedded object. The object inter linking capability is used to maintain 
interdependency information for an application. Java distributed object architecture is 
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used along with the object serialization for network communication among GUI, 
DedipServer and agents. Hence, WebDedip can be used on a LAN, WAN or on 
Internet. Agents may run on any system over Internet. On start, an agent makes 
connection with DedipServer on a predefined port and volunteers for computation 
workload. Java object persistence is used in storing the information, including 
dynamic information. The same is explored in communication among the GUI, 
DedipServer and agents. 

The Windows-explorer is used as a metaphor in developing the navigation GUI 
due to its popularity and ease of use (see figure-2). 



3.1 Application Configuration 

The application designer first decides the configuration of his application. It depends 
on the distributed resource requirement, parallel processing requirement, input/output 
of each process, etc. The WebDedip supports a nice GUI for the same as shown in 
figures 2-5. Figure-2 shows the overview of the application. The typical 
interdependency chart, generated interactively, is shown in figure-3. The detailed 
information about each process is shown in figure-4 for a typical process. The line 
joining two processes shows their interdependency in top-down model. The 10 
dependency, if any, is a part of this interdependency and it can be easily configured. 
The typical configuration is shown in figure-5. User can modify his application 
configuration file any time. The effect of the modification will be applied on next 
execution of the application. 




Fig. 2 Basic Information of the Application 
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Fig. 3 Process interdependency Information 




Fig. 4 Process Information 




Fig. 5 Data dependency information 
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3.2 Application Building 

An application consists of many processes. All these processes need to be compiled 
on the target node. The WebDedip has automated all these compilation. The 
configuration information has all the required details about each process. The 
DedipServer copies the source code & make-file, required to build a process, on the 
target node in a predefined temporary area. It then requests the agent on the node to 
build the process using the make-file. It carries out this task for each process given in 
the configuration. The agent creates designated directory and preserves the executable 
in it. The application designer can build the processes externally on all systems in 
case he is not willing to give the code. The GUI provides necessary support for such 
external readiness indicator. 

3.3 Application Execution and Monitoring 

The operator can start execution of any application from any machine on the net using 
the standard browser. GUI displays the configured applications to the operator for 
selection. Operator can start/abort/suspend/resume an application. Figures 6 & 7 
show the GUI for session and application progress information. 




Fig 6 Session progress information 

3.4 Error Handling 

In case of abnormal completion, the DEDIP Server displays the error message with 
error code to the operator. These error codes and error messages are provided by 
application designer. WebDedip keeps this information in the configuration file. The 
operator can restart the process after taking the necessary actions. In addition, the 
operator has the options of either restarting the entire application or aborting it. 
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Fig 7 Application progress information 



3.5 Session Management 

Each time an operator logs in, DEDIP scheduler starts/restarts a session for him. Each 
session has a unique session identification number. It keeps all the information about 
the session on the server. The operator has multiple options to log out. He can close 
the session, terminate the session, suspend/resume the session, or submit the session 
for progress in background before logging out. He can close the session only after 
normal completion of all the requests he has submitted. He can terminate the session 
immediately in case of emergency. In case of termination, the WebDedip kills all the 
processes of all the requests submitted by the operator irrespective of the status. The 
background processing is very effective in the case of non-interactive processes. The 
WebDedip gives the detailed status to the operator at the next logon. 



3.6 WebDedip System Management 

The WebDedip system consists of a DedipServer and agents. The DedipServer can 
detect the agent termination. It displays the message on operators’ console as well as 
operation manager console. 

The DedipServer is the most important process in the entire system. Its failure, for 
example, due to system crashing, can cause a severe problem. DedipBackServer is 
designed to handle the failure of the DedipServer. The software package 
DedipbackupServer runs on the machine of the backup server and duplicates the 
required information from the DedipServer. An agent sends a trigger to 
DedipBackupServer when it fails to communicate with the DedipServer. The 
DedipBackupServer validates the DedipServer failure. It takes over the complete 
responsibility from that moment onwards and informs the operation manager. The 
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servers are exchanging information only in case of external events like termination of 
process, start of new process, initiation of an application by the operator, the start of 
new session, etc. The frequency of such possible events is very low. Furthermore, the 
volume of the information is negligible. Hence, the communication overhead for 
maintaining the back-up server is very low. 



4. Case Study 

WebDedip functionality and efficiency was tested using Microsoft NT as host and 
IRIS workstations as slaves. IIS 4 was used as web server. The front-end GUI is 
tested on two most popular browsers IE and Netscape 

The WebDedip was tested for three cases [11] using simulated executables by three 
operators in ten runs. The simulated processes were generated resembling actual 
processes for image processing interaction/processing. The process dependency chart 
is given in figures 8-10. The processing node is shown in the bracket if it is different 
than host. DTHS stands for Data Transfer from Host to Slave while DTSH stands for 
the reverse process. 'T' indicates the tape unit requirement by the process. ‘W’ & 
‘W2’ indicates that the process is scheduled on workstation! and workstation2 
respectively. The time (in minutes) required by each process is shown in bracket. 



PI P2 P3 DTHS P4 P5 DTSH P6 P7 

(3,T)^(3) ^ (4) (4) ^(2,WT^(5,Wr^ (4) ^(3) ^ (2,T) 



Fig. 8. Simulated case-1 for testing 
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Fig. 9. Simulated case-2 for testing 
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Case 1: Single package requiring sequential scheduling is shown in figure 8 
depicting the simplest case. 



Case 2: Single package requiring parallel scheduling is shown in figure 9. 



Case 3: Parallel execution of two packages, each package requiring sequential 
scheduling, is shown in figure 10. 



Table 1 : Results for the case studies (time in minutes) 


Case 


Theoretical 


Without Dedip 


WebDedip 


1 


30.0 


32 


32.0 


2 


23.5 


52 


25.0 


3 


42.0 


74 


46.0 



The efficiency results are almost the same as those achieved in the earlier version, 
ie 90-95%. The page & applet loading time over the network is excluded. The access 
time in case of WebDedip is mainly due to two reasons: (1) action communication 
delay and (2) DedipServer overheads. This action communication delay was 
measured for various actions by repeated exercises. It was found out to be 
approximately 10 to 40 seconds on this type of action. The remaining are the 
DedipServer overheads. 

Recently, a few scientists were engaged in developing web-based project 
management & work flow applications [14] like hierarchical progress reporting & 
compilation, meeting management, project task management, personal task 
management, document authentication, resource booking, complaint management, job 
work flow and remote system configuration detection. These applications were having 
distributed processing requirement amongst the browser based remote machines, web 
server, database server and mail server. They used WebDedip instead of Java 
servlets. The used their own GUI could communicate to DedipServer and agents 
using RequestObject, a message passing object from WebDedip library [13]. 
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5. Related Work 

In this section, we summarize the research efforts that are closely related to our work. 
JPVM [1], and Java MPI [2] are the Java extensions of PVM and MPI respectively. 
JavaParty [3], ParaWeb [4], Charlotte [5], Popcorn [6], and Javelin [7] are Java based 
systems for distributed computing using Java. JavaParty provides mechanisms for 
transparently distributing remote objects. ParaWeb is an implementation of the JVM 
that allows Java threads to be transparently executed remotely. Charlotte provides 
high level solution that decouples programming environment from the execution. Its 
disadvantage is that the programmer does not have explicit control over resource 
utilization. However, its eager scheduling enables the runtime systems to efficiently 
provide load balancing. Popcorn provides a Java API for writing parallel programs for 
Internet distribution. Applications are decomposed, by the programmer, into small, 
self-contained subcomputations, called computelets. The Popcorn is based on buyer- 
seller concept. It has a centralized entity called market that determines which CPU 
seller executes the computelet. Javelin is an infrastructure for Internet based parallel 
computing. Any free computer system can volunteer to execute a task using the 
applets supported by the Javelin. It follows a client-broker- server architecture. 
Bayanihan [8] and Ninflet [9] are also very similar to the Javelin. 

The methods, reported in most of the above, concentrate in providing computation 
power to a large and complex application efficiently. All of them expect efficient 
parallel and distributed programming skills. Their definition of ease of use is around 
application compilation, scalability, load balancing, fault tolerance, etc. The 
WebFlow [10] is closest to our work. It provides Java-Swing based visual 
programming environment for metacomputing using Java. It supports Globus [15] 
metacomputing toolkit at the backend. 

The programmer needs to use Java for metacomputing language in the above 
models. Furthermore, the GUIs of the above models support the monitoring and 
controlling an application (large & complex) in stand-alone mode. Therefore, they do 
not require elegant & easy GUI for simultaneous execution, monitoring and 
controlling of multiple applications. 

WebDedip (& DEDIP) concentrated on the vast community of scientists rather 
than efficient programmers. It made the distributed application development very 
easy. It supports all languages like Fortran, C, C-r-r and Java. Its GUI supports all the 
needs of operational environment executing multiple heterogeneous applications 
simultaneously. It has its own backend support for process scheduling and 
monitoring. 



6. Conclusion 

The WebDedip provides a useful facility to the designer to develop the distributed 
image processing application in a user-friendly environment. The browser based GUI 
enables him to use the system functionality from anywhere over the Internet. The 
graphical user interface makes it easy to visualize and configure the application. 
Furthermore, WebDedip addresses all the critical elements for smooth operations. The 
option of back-up server support makes the entire system robust. 
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The results obtained from the simulated test cases for the WebDedip match with 
those of the earlier version. The communication delay over the network is the only 
additional delay. The earlier version of the model was used by 15 scientists for 
development and operationalization of 10 distributed image processing applications 
for Indian Remote sensing Satellite (IRS). The same is likely to be replaced by the 
new WebDedip. 

Although the WebDedip has been tailored for the requirement of image processing 
applications, its design and architecture is truly general so that it may be used for 
other applications also. Collaboration is being worked out with Nirma Institute of 
Technology to use WebDedip in field of advanced computing for civil engineering. 

A study is being carried out for interfacing WebDedip and PVM. 
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Abstract. The creation of parameter study suites has recently become a more 
challenging problem as the parameter studies have become multi-tiered and the 
computational environment has become a supercomputer grid. The parameter 
spaces are vast, the individual problem sizes are getting larger, and researchers are 
seeking to combine several successive stages of parameterization and computation. 
Simultaneously, grid-based computing offers immense resource opportunities but 
at the expense of great difficulty of use. We present ILab, an advanced graphical 
user interface approach to this problem. Our novel strategy stresses intuitive visual 
design tools for parameter study creation and complex process specification, and 
also offers programming-free access to grid-based supercomputer resources and 
process automation. 



1 Motivation and Background 

Only a decade ago, the solution of the partial differential equations required for the 
evaluation of aerospace vehicle flow-fields typically involved a single discretization 
zone and was performed on a single processor of a high-speed compute engine that was 
usually situated locally. These compute tasks were so costly in CPU cycles that the notion 
of performing parameter studies was usually ignored. Now, however, the flow-solvers 
are typically parallel codes. The compute engines are frequently large parallel machines 
with multi-gigabyte memories and terabyte disk farms. Researchers have available the 
resources not only of their own laboratories but also those at other computer centers 
accessible via fast networks. Parameter studies are now quite feasible and are being 
performed on a regular basis by researchers who require solution information throughout 
a given aerospace vehicle flight regime. The difficulties, however, have shifted to the 
manual creation of these parameter studies and to tasks associated with launching and 
managing the large number of jobs required by these studies. Modern aerospace flow- 
solvers frequently require large sets of discretization grids which describe the geometry 
of the aerospace vehicle. They produce as output large collections of data files. Currently, 
most parameter studies are performed with two-dimensional flow solvers, but three- 
dimensional solvers are also beginning to be used. 

Recent developments in grid-based “metacomputing” such as Globus [1]. and Le- 
gion [2] have created opportunities for running parameter studies on remote networked 
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high-performance compute servers which constitute a shared resource for participants. 
But these opportunities come at a price: the proliferation of job control language (JCL) 
to support these capabilities. This has placed an onus on users of these metacomputing 
grids, who are typically engineers and researchers not well prepared or enthusiastic ab- 
out learning or creating the requisite control language scripts for managing distributed 
parameter studies. NASA is currently building a national metacomputing infrastruc- 
ture, called the “Information Power Grid” (IPG) [3], intended to provide ubiquitous and 
uniform access to a wide range of computational, communication, data analysis, and 
storage resources, many of which are specialized and cannot be replicated at all user 
sites. However, the interface to the IPG is still under development. 

We briefly describe the notion of a “parameter study” by giving two general exam- 
ples. Simulation codes produce solutions to scientific or engineering problems for some 
set of input values (“parameters”). Varying these parameters through some prescribed 
range (the “parameter space”) yields a set of related results, called a “parameter study” 
(sometimes written as “parametric study”). As a second example we point to Monte Carlo 
simulations. Monte Carlo codes are typically run many times in order to to produce sta- 
tistically meaningful ensemble averages. This too can be considered a parameter study, 
where the parameter to be varied is merely the seed for random number generation, and 
does not actually have any physical significance. 

The end product of creating and launching parameter studies is typically a large suite 
of result files which must be postprocessed and/or moved to some form of long-term 
storage. Furthermore, parameter study users must be able to keep track of these results 
and log into a scientific diary such particulars as nature of the solved problem, location 
of the result files, history for the individual runs, and any other associated information. 
Being able to easily recreate and then modify the parameter study is also an important 
need for many users. 

We conducted a literature survey to identify existing parameter study capabilities 
that fulfilled the need of users at NASA Ames Research Center. The only tools deemed 
applicable for these tasks were the historically related Cluster and Nimrod codes [5]. 
Both are able to generate and launch simple parameter studies. They also implement 
an internal “meta-language” for describing parameter study creation. Additionally, they 
make it easy to parameterize command line arguments. 

However, they did not fully meet the requirements of our users. Some of these are as 
follows. Users must have access to multiple job submission environments. These must 
include any combination of PBS [6], LSF [7], MPI, Globus, Condor [8], and Legion. 
Also, users require the ability to create what we call “multi-stage” parameter studies (a 
detailed example appears in section 7). Users also need a “fire-and-forget” capability, i.e., 
once the parameter study suite is created, it should be possible to initiate job launching 
and then shut down the parameter study tool entirely. Job submission should continue 
autonomously, and without the continued presence of the parameter study tool. Users 
also require a fairly comprehensive level of job auditing and scientific diary capability, 
the secretarial side of a problem solving environment (PSE). On the development side, 
we needed to design a parameter study tool that could be easily extended using a high- 
level rapid-prototyping language (such as Perl). This is because we envision using the 
tool as a testbed for experiments in parameter study creation models, job submission 
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models, and complex process specification models. We also need to be able to use the tool 
to generate shell scripts designed for parameter study job submission and for complex 
process job submission (visual scripting). It is essential that the script generation process 
be very flexible. 



2 Problem Definition 

Creating and launching parameter studies without the assistance of automating tools 
is laborious, tedious, and error-prone. Examining the stages of this task allows us to 
discern the nature of the inherent problems. The first stage is to create the parameterized 
input files which incorporate the sets of values representing the parameter study. These 
sets are the Cartesian product of the individual sets of values over which each of the 
parameters of interest varies. The total number of combinations (the parameter space) 
can quickly get to be very large, and creating these sets of input files manually is time- 
consuming and error-prone. Each of the resultant input files represents a run of the user 
program (a job). Launching jobs involves setting up partitioned file spaces in which 
they can run, supplying each with all required input files, submitting them, and then 
monitoring progress and managing output. Our first design requirement was that all 
of these functions be automated and integrated into a single Graphical User Interface 
(GUI). The second requirement was that of simplicity of use. We believe that users are 
very sensitive to ease-of-use issues, and that they will avoid process automation tools 
that are deemed difficult or non-intuitive. The third requirement is that a parameter study 
tool be able to self-document its actions. If it cannot, users will quickly be mired in a 
morass of hundreds, even thousands, of old runs whose origin and purpose are no longer 
obvious; a complete parameter study tool must be part PSE and part scientific diary. 
The fourth requirement is that of job submission flexibility in a scientific computation 
environment currently in flux. This is because “Grid-based” computing has added new 
complexities and layers of JCL to the task of submitting jobs. ILab meets all of these 
four user requirements. 



3 Basic Assumptions and Requirements for Distributed Processing 



We have started with two basic assumptions about NASA’s distributed computing envi- 
ronment into which jobs will be launched. The first is the need to maintain production 
level capability. This has significant implications, because all compute-intensive appli- 
cation processing must occur under the aegis of a job scheduler and queuing system. Any 
other manner of submitting to shared computational resources would violate “good neig- 
hbor” policy. The second assumption about our distributed computing environment is 
that it should be able to leverage the Globus metacomputing middleware currently being 
developed at Argonne National Laboratory. It must also be possible for parameter study 
users to bypass the Globus layer and still submit jobs into a distributed environment. 
This has resulted in a design incorporating several job models for spawning parameter 
studies in a distributed fashion. 
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4 ILab: The IPG Virtual Lab 

We describe important features of the ILab parameter study tool, in particular parame- 
terization operations and aspects of the internal coding design. 

4.1 Parameterization of Program Input Files 

In order to minimize the difficulty of building a set of parameterized input files, ILab 
includes an integrated, special-purpose text editor. This editor has unusual capabilities: it 
allows the user to mark graphically the appropriate parameter data fields and to designate 
the set of values for each selected field. This parameterizer is depicted in Fig. 1 . Value sets 




Fig. 1. ILab parameterization screen 

can be specified either as a list or by min/max/increment. The user first selects (highlights) 
with the mouse those ASCII text fields within the input file which will be parameterized. 
In Fig. 1, “beta” and “reynum” (known to ILab as Parameter 1 and Parameter2) have 
already been parameterized; their value sets are displayed in the left window. Currently 
the user is specifying the third parameter in the “Set Param Values” dialog. If several 
fields must be parameterized in tandem (example: multiple occurrences of “timestep” 
for each of several related discretization zone input files), that can be indicated at this 
stage. After text selection of the appropriate fields, the user enters a list or range of values 
for the selected fields. Lastly, the set of parameterized input files is generated. These 
files constitute the Cartesian product of the individual parameter sets. As an example, if 
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three input values are to be parameterized (a 3 -dimensional parameter space), the first 
with the set of values {1, 2, 3, 4}, the second with {hello^ goodbye}, and the third with 
{3.14, 2.718, 1.618}, then a total of 4 x 2 x 3 = 24 parameterized files will be produced. 

Because the file parameterizer is integrated within the ILab GUI and because its use 
is intuitive, the process of parameterization of the input files has been made trivial. Ad- 
ditionally, a “most-recently-used” (MRU) capability saves the current parameterization 
state for future reference and for reuse or modification. 



4.2 Job Masking Capability 

One of the necessities of a parameter study program is to provide “masking” capability for 
a set of parameterized input files. Users require this ability when they know that certain 
parameter combinations will produce an unsuccessful run of the scientific program under 
consideration. Typically, they want to specify combinations of parameter values that 
will be excluded from the set of input files and their associated script files. ILab’s “Edit 
Parameters” screen - the special purpose editor described in section 4.1 - has a pop-up 
dialog for this purpose. Users can enter any number of masking rules, and each rule 
must specify two or more parameter comparisons. For example, if the user is varying 
Parameter 1 from 1 to 10, and Parameter2 from 55 to 75, and wants to exclude those 
combinations where Parameter 1 is greater than 9 and Parameter2 equals 60, the masking 
rule would be entered as: 

Parameterl > 9 && Parameter2 == 60 

This syntax, which is the same in Perl, C, C-f-f, and Java, was chosen since users are 
likely to be familiar with it. The names Parameterl and Parameter2 are assigned in order 
by ILab to the values being parameterized. (ILab, of course, has no way of knowing the 
actual names of parameters in the user’s input files since there is no requirement that 
the input have labeled data. In the example in Fig. 1, it just so happens that the user is 
parameterizing a Fortran “namelist” file with labeled fields, but ILab itself only requires 
that the input be ASCII.) By using Perl’s “eval” function, we can easily interpret the 
above rule with minimal parsing, and use it to delete job objects from the user’s list of 
experiments. 



4.3 Coding Model and Language Choice 

We have chosen to build our ILab GUI using Perl5 and the Tk user interface construction 
tool kit. In addition, we have used the Perl generation capabilities of the “SpecTcl” Tk 
GUI generation IDE [9], a free software tool available from Sun Microsystems. Our 
choice of Perl5 was based on its strong character string manipulation and built-in regular 
expression capabilities, strong list and sortable associative-hashtable datatypes, and its 
simple-to-use object-oriented features. Also, Perl is relatively ubiquitous and is amongst 
the fastest interpreters commonly available today. Altogether, these features make Perl 
an excellent choice for true rapid-prototyping. Though we cannot exactly quantify the 
savings in the coding effort, we believe, based on prior experiences, that the equivalent 
functionality would require two to three times as much or Java. 
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4.4 Object-Oriented Data Structures and Strategies 




We used Perl “packages” (the equivalent of classes in C++ and Java) to hold all ILab 
data, both persistent and transient. Fig. 2 depicts the data structures hierarchy. 

An Experiment package 
holds all persistent data: the data 
is serialized (written en masse, 
retaining data structure hierar- 
chies) to and from disk with the 
use of Perl’s Data: : Dumper 
module. To reduce the size of 
the Experiment package, 
several other arrays of packa- 
ges apportion data that has to 
be held in lists or arrays: a 
ParamFile package for each 
input file to be parameterized, 
a ParamData package for each 
variable being parameterized in 
Fig. 2. ILab data structures each input file, and a Job pack- 

age to hold run-specific data. ParamFile and ParamData hold file- and variable- specific 
data while the Experiment is being created and edited. To run a user’s parameter study 
Experiment, a list of Job packages is created: some ParamFile and ParamData data 
is transferred to Job packages, and additional data is attached. The organization of data 
in ParamFile and ParamData is “orthogonal” to the way the same data is organized 
in the array of Job packages: this simplifies script creation, submission, and monito- 
ring. Essentially, data is in arrays of arrays during editing/creation, while during sub- 
mission/monitoring the same data is flattened out into a one-dimensional array of Job 
packages. Both sets of data are serialized when an Experiment package is serialized. 

Each window or dialog box is also a package, which holds transient data: user 
interface references and data as necessary, and also “mirror” portions of the current 
Experiment data. This duplication of data makes it easier and more robust to edit pre- 
viously entered data, since a user can make changes and then cancel the changes without 
having to restore the original data. Another important advantage is gained from the “mir- 
ror” and “orthogonal” approaches: the trade-off is more data, less code. Problems in the 
data are easier to find and fix than problems in the code. Debugging is also facilitated by 
the following strategies: (1) each package has a “dump” function to print out all varia- 
bles and (2) each package error- traps the setting of any variable inside the set portion 
of a get /set function. Caveat: data duplication is not a good or dependable strategy, 
unless it is closely integrated with code design. This integration means constantly revie- 
wing the data members of packages, and moving data members as appropriate to avoid 
inconsistencies and incoherencies in the package design. 

To keep the code structure simple and intelligible we avoided as much as possible 
the use of inheritance. Some of ILab’s dialog packages are derived from existing Tk 
packages, but this derivation is only one level deep, and fairly transparent. The only 
data structure that requiring inheritance is our JobModel package, because (1) we have 
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several “job models” already, and they have enough similarities and differences to justify 
the existence of a base class and (2) more derived job models will need to be added in 
the future, as ILab is expanded to accommodate more meta-computing environments. 
The various job models are described in section 5. 

Perl is a highly flexible language. We were able to further simplify our packages 
by giving the package members and the get /set functions the same names, since the 
Perl interpreter distinguishes the variable from the function syntactically; the variable is 
$ref erence->{name}, and the function $ref erence->name. Note that Perl already 
makes it easy to collapse get and set functions into one function, so that only one 
function accompanies each data member. 

Here is an example of this approach in our Job package, showing the new (con- 
structor) function, two data members, (JobID and Status) and the Status (get/set) 
function associated with the Status data member: 

package Job; 
sub new { 

my $class = shift; 
my $self = {}; 

$self->{JobID} = undef ; 

$self->{Status} = ^NotStarted^ ; 
bless $self, $class; 
return $self ; 

} 

sub Status { # Only allow one of six strings for this field 

my $self = shift; 
my $temp = $_ [0] if 

if ( defined ( $temp ) ) # set variable if argument passed in 
{ 

if ( $temp eq ^NotStarted^ I | $temp eq ^ Queued^ I | 

$temp eq ^ Running^ I | $temp eq ^ Stopped^ I | 

$temp eq ^ Failed^ I | $temp eq ^Done^ ) 

{ $self->{Status} = $temp; } 
else { print "illegal job status = $temp\n" ; } 

} 

return $self->{Status} ; # get func. always returns variable 

} 

In a large program this “same-name” model reduces the number of occurrences where 
the programmer has to reference another part of the code to ensure that names are 
correct. For those packages that need to be made into base packages (for derivation of 
mostly similar but slightly different packages), we extend the object-oriented approach 
by putting the package variables into a “closure”, thereby making these data members 
less accessible to programming users of the base class. 
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5 Job Models 

We describe the various job models that ILab currently supports in order of increasing 
complexity. The simplest represents an entirely local capability, i.e., all jobs making 
up the parameter study are submitted for execution on the local machine. The runs 
occur without the assistance of a scheduler, but may include a parallel job launcher such 
as “mpirun”. ILab generates for each run in the parameter study a single shell script, 
which constructs a main directory for the whole study (if one doesn’t already exist), 
and then builds its own subdirectory, uniquely named with an automatically generated 
parameterization identifier. Files required for input by the user’s executable are copied 
into the respective subdirectories. The executable is then started. Because no scheduler 
is assumed, jobs are run sequentially to avoid oversubscribing the local system. This is 
accomplished by chaining the shell scripts: the first script does its work and then submits 
the next script in the chain, etc. This chaining proceeds even if some command within a 
script fails (e.g., the user’s primary compute executable). 

The second job model launches jobs onto a cluster of machines (which may include 
the originating machine), on which the user has accounts and an appropriate “.rhosts” 
file. Each job is implemented with a pair of shell scripts. The first remote-copies (Unix 
“rep”) the second script to the remote machine and then executes (Unix “rsh”) it there. 
It is the second shell script that creates and organizes directory layout on the remote 
host, and which starts the chain of computation. This job model currently makes no 
use of schedulers. We have not built in any mechanism for limiting the number of 
concurrently running jobs on any individual resource. This implies that the individual 
compute resources may become oversubscribed. We are planning to add a non- scheduler- 
based job “limiter” into this job model. 

The third job model is similar to the second, except that the presence of a scheduler 
is assumed. When the scheduler is PBS, the first shell script submits to the scheduler a 
script containing PBS directives followed by shell commands. 

The fourth job model assumes that the Globus metacomputing middleware is used 
for remote job submission and file manipulation and that a scheduler (PBS) is used for 
queuing and starting jobs. The remote script is similar to that of the third job model. 

None of the above shell scripts need to be provided by the user; they are automatically 
generated by ILab. In each of the above cases, a parallel job loader (currently MPI is 
supported) may be specified. 

Currently files are not cached on the remote systems at the time of job submission. 
We assume a production level environment requiring routing through a job scheduler. 
This implies that the third and fourth job models will be the most heavily used. The 
typical usage scenario is that a suite of jobs is submitted through a scheduler, and that the 
compute resource is shared with other users. The time for the parameter study to complete 
will often be numbered in days, not hours or minutes. This is based on experience 
at NASA research centers and on knowledge of the types of parameter studies users 
are contemplating. Such computations frequently utilize volatile scratch file systems 
when user allocations of permanent file-system space are insufficient to accommodate 
the input and output files of substantial parameter studies. Advance copying of files is 
therefore risky, since cached input may have been purged by a file scrubber by the time 
an individual job is started by the scheduler. However, we have devised a method for 
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just-in-time caching. It guarantees that (1) only one process copies an input file to a 
cache, which avoids clobbering (involves lock file), and (2) that files previously cached, 
but subsequently deleted by a scrubber, are re-cached on demand by the client job. We 
will add these capabilities to the third and fourth job models. 

Utilizing shell scripts has several advantages. Unix shell languages are the “lingua- 
franca” of Unix JCL. Our choice is the Korn shell [10], a highly expressive language 
for constructing sequences of commands, and for error- trapping them. In the Korn shell, 
background processes and “co-processes” (background processes that can communicate 
with the parent process) are easily created. Processes may also be easily monitored, 
and killed if necessary. Another advantage of using shell scripts is that they may be 
invoked independently of the GUI. There is no requirement that only the ILab GUI 
start user processes. Users may modify the shell scripts for their own purposes; they 
are“recyclable.” The commands in the scripts are interleaved with output statements, 
which leave a record of their workings which acts as a log. 

ILab may, in part, be described as a GUI that collects information on the locati- 
ons of the user’s executable and input files, and assembles shell scripts for running this 
executable. It is fairly easy to change existing job models or add new ones, which is 
accomplished simply by modifying the script generating code within ILab. In order 
to simplify the addition of future job models to ILab, we used the object inheritance 
capabilities of Perl to create a base JobModel package and several derived packages 
(Local JobModel, Globus JobModel, etc.). New derived job models, e.g. for the meta- 
computing environments Condor and Legion, can be easily inserted into the existing 
framework. 

It is possible, and easy, to use ILab to launch single jobs (i.e. a singleton parameter 
study) into a local or remote compute environment that may require any of Globus, PBS, 
and/or MPI. Thus, ILab may be used simply as a convenient Unix JCL script generator 
for launching single jobs. This is especially beneficial when a job will be run on a remote 
system and requires the migration of input files and executable. 



6 Parameter Study Example - A Case Study 

Until recently, parameter studies of aerospace vehicle flow characteristics utilized mostly 
two-dimensional computational fluid dynamics (CFD) solvers. This was partly dictated 
by limitations in the available compute resources (CPU time and memory size). Re- 
cently, however, because of the increased availability of multi-processor machines with 
large memories, parameter studies based on three-dimensional CFD codes have become 
feasible. Nevertheless, the overhead for such large studies remains high. As an exam- 
ple, we chose the Overflow three-dimensional Navier-Stokes flow solver [11], which 
employs the overset grid method (overlapping curvilinear grids exchange interpolated 
boundary information at each time-step). The MPI parallel version of Overflow groups 
neighboring grids for solution onto individual processors. We used Overflow to compute 
the flow field of the X38 Crew Return Vehicle (CRV), a NASA space vehicle. Fig. 3 
depicts the X38 CRV and several of the body-fitted curvilinear grids defining its surface. 
The complete configuration consists of 13 curvilinear body-fitted grids and 115 rec- 
tilinear off-body grids, totaling approximately 2.5 million points. Overflow uses some 
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40 double-precision words of memory per grid point, which results in a total memory 
requirement of approximately 800 MB per run. 

We chose to create a 16 x 12 parameter study for two significant flow variables in a 
portion of the glide regime of the X38: Mach number (normalized vehicle velocity) and 
Alpha (“angle-of-attack”). This results in a two-dimensional parametric study consisting 
of 192 runs. Each run for the X38 vehicle requires four processors. 

Using ILab involves the following steps. First, the user supplies a name and directory 
for the “experiment”, which is where the records for this study will be kept. Then, the 
local and remote machines on which the runs will occur are selected. Next, an input file 
directory, and the input file(s) to be parameterized, are specified. Input files are displayed 
in the special-purpose graphical editor, and parameterized as described in section 4.1, 
producing 192 parameterized input files. Next, the user identifies the executable name 
and location and also a directory where the run subdirectories will be rooted on each 
of the executing hosts. Options for specifying MPI, Globus, and PBS, and number of 
processors (four per run, in this case) are set. At this point, the appropriate shell scripts 
are generated and then initiated. This entire entry process takes under five minutes. 
If starting from a previous experiment, an MRU file may be selected, permitting the 
user to make appropriate modifications through the same widgets used to create a new 
experiment. For cases like that described above, it usually takes under a minute to create 
and start an entire parameter study. 

Two machines were selected for running the jobs, each supporting MPI, Globus, and 
PBS. The scripts generated by IFab conformed to our fourth job model. IFab submitted 
all jobs to PBS queues on the selected machines, and within approximately 24 hours all 
jobs completed. From the resulting solutions we constructed a plot of the coefficient of 
lift over drag (Cl/Cd) for the X38 CRV. See Fig. 4. Every point in the lattice represents 
a complete flow solution. 




Cl/Cd 



Fig. 3. X38 Crew Return Vehicle with several Fig. 4. Coefficient of lift over drag for the X38 
of its computational body grids CRV 




7 CAD Tool Process Specification 

Currently, all information describing the user’s process is collected through a series of 
previous-and-next- wizards, guiding the user through the process specification procedure. 
Though this model is acceptable for single stage parameterizations, it quickly becomes 




156 



M. Yarrow et al. 



inadequate for specifying complex user processes. These may include several stages of 
parameterization, pre- and post-processing of data, archiving of data, resubmission and 
restarting of user programs, feedback loops to accommodate multidisciplinary optimiza- 
tion, etc. Currently, we are building a visual capability for complex process specification, 
providing an alternative to the wizard mechanism. It consists of a CAD tool for con- 
structing a data-flow diagram describing the user’s set of processes. The user creates a 
diagram by choosing individual process element icons from a palette and placing them 
in the diagram by mouse operations. Each icon represents a basic process building block, 
such as input file parameterization, moving, copying, or renaming files, running an exe- 
cutable, etc. At each node of the diagram a context-sensitive pop-up dialog queries the 
user for the necessary details. Internally, a directed graph representing the entire set of 
processes in the Experiment is created, e.g., a parameterization, followed by the execu- 
tion of a simulation program, followed by the execution of post-processing program(s) 
and archiving. This graph is interpreted, and the required individual shell scripts are 
constructed from the information stored at each node. The construction process consists 
of assembling the required shell scripts from macros, small groups of ILab-provided 
shell commands that perform the requested operation. 
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Eig. 5 depicts an example of a 
multi-stage parameterization pro- 
cess. In the first stage, input to 
a grid generation program (Grid- 
der) is parameterized, resulting in 
three input files. After running the 
grid generator, three grid systems 
have been created. These grid sy- 
stems will be part of the input to a 
flow solver (Solver), as will a flow 
variables input file. It is this flow 
input file which is subjected to the 
second stage of parameterization. 
In the figure, a four- way parame- 
terization has been applied. Each 
of these four flow input files must 
be replicated three times to be pai- 
red with the three grid files, and 
each of the grid files must be repli- 
cated four times for pairing with 
the flow input. The result is essentially a two-dimensional parameter study (3x4), but it 
has resulted from two independent stages of parameterization. This adds a higher degree 
of complexity to the user’s process, and consequently, to the mechanisms required for 
assembling and running these jobs. It is, in part, for this reason that we are constructing 
a more powerful user interface mechanism for specifying and creating parameterization 
processes. 







Replicate 
each grid 
file four-fold 



Fig. 5. Multi-stage parameterization process 
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Summary 

The needs of our user community have triggered the development of ILab, a flexible pa- 
rameter study creation and job submission tool. This modern GUI implements a modular 
experimental workbench for programming research into local and remote job submission 
methods, complex user process specification technology, and for experimentation with 
IPG middleware. Our choice of Perl/Tk as a rapid-prototyping development language 
strongly facilitates experimentation and anticipated further expansion of the core user 
GUI capabilities. We have proven our ILab product with significant parameter study 
computations in a distributed environment. We are currently working closely with users 
whose parameter study requirements are demanding. We are adding these new capabi- 
lities to ILab using an advanced CAD-based user-interface technology. 
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Abstract. A Grid system is essentially an infrastructure that allows location 
independent access to the resources and services that are provided by 
geographically distributed machines and networks. One of the fundamental 
operations needed to support location-independent computing is resource 
discovery. Generally, resource discovery schemes maintain and query a 
resource status database. Dissemination of the resource status information is 
one of the key operations required to keep the resource status databases 
consistent. This paper examines several approaches for resource status 
dissemination. A new concept called the Grid potential is introduced in this 
paper. This concept is used to control the extent of data dissemination in Grid 
systems. 



1 Introduction 

The deployment of faster networking infrastructures and the availability of powerful 
microprocessors have positioned network computing as a cost-effective alternative to 
the traditional computing approaches. The Grid is defined as a generalized, large- 
scale network computing system that is formed by aggregating the services provided 
several distributed resources [2, 6]. A Grid can potentially provide pervasive, 
dependable, consistent, and cost-effective access to the diverse services provided by 
the distributed resources and support problem solving environments that may be 
constructed using such resources. 

One of the key motivations for constructing Grids is to provide application-level 
connectivity among the various machines so that resources and services supported by 
the individual systems can be shared in a Global fashion. To enable such sharing, it is 
necessary for the Grid architecture to support several services [2, 7] and resource 
discovery is one of them. 

In a Grid system, the resource discovery service may operate in conjunction with 
the resource management service. When a client requests service, along with the 
request it presents a set of attributes that should be satisfied by a candidate resource. 
The resource discovery process may be responsible for generating a set of best 
possible candidates for the given set of attributes. The scheduling heuristics that are 
part of the resource management mechanism may allocate the best resource(s) from 
the set based on the some criterion. For example, the resource management may 
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solicit bids from the potential candidates and select the resource with the highest bid 
to serve the request. Along with other services, resource discovery is necessary to 
support resources going off-line and coming on-line. Further, the cascaded operation 
of resource discovery followed by resource allocation can be efficient in an 
heterogeneous dynamic system such as the Grid. 

Generally, resource discovery services use “status” databases that are maintained 
by network-wide information services to fulfill the client requests. For scalable 
implementations, it is essential to organize the status databases in a distributed 
fashion. With a distributed organization for the status databases, the queries can be 
executed very efficiently but the updates to the databases may be costly. Most of the 
update costs are caused by the communication operations performed to disseminate 
status information across the Grid. This paper focuses on approaches for reducing the 
data dissemination overhead. 

In this paper, we introduce a concept called the “Grid potential” that encapsulates 
the relative processing capabilities of the different machines and networks that 
constitute the Grid. We show how the Grid potential can be used to adaptively control 
the extent of data dissemination in a Grid. 

Section 0 proposes the idea of Grid potential that is used to adaptively to control 
the data dissemination overhead. Section 0 discusses the data dissemination 
approaches for resource discovery operation in the Grid context. Some results from 
simulation studies that compare the different approaches to data dissemination for 
resource discovery are presented in this section. Section 0 examines the related work 
in the research literature. 



2 Grid Potential 

The Grid potential concept is similar to the time-to-live idea used in the Internet [5]. 
Informally, the Grid potential at a point in the Grid can be considered as the 
computing power that can be delivered to an application at that point on the Grid. The 
computing power that can be delivered to an application depends on the machines that 
are present in the vicinity and the networks that are used to interconnect them. 
Consequently, a high-performance machine when connected to the Grid will induce a 
large Grid potential. This potential, however, will decay as the launch point of the 
application moves away from the point at which the machine is connected to the Grid. 
The rate of potential decay depends on the network link capacities. The rest of this 
section presents a formal definition of the Grid potential idea. 

A node in the Grid has several attributes that can be categorized as rate-based 
attributes and non rate-based attributes. Examples of rate-based attributes include 
CPU speed, FLOP rating, sustained memory access rate, and sustained disk access 
rate. A node in a Grid can be characterized by a vector where each element of the 
vector is an attribute-value pair. 

The Grid potential is based on the computing power or operating rate of a node. 
Therefore, to characterize a node for deriving the Grid potential only rate-based 
attributes are considered. Let X = (vq = aQ,xi = ai,... , where x- 
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is a rate-based attribute of the system and a/ its value at a given time. Let F be a set 
of functions {/o,/b---,/y^-l} , where // operates on the set X to return a scalar 
value Af = x^_i) . Depending on the system, different functions may be 

defined for it. The functions essentially form weighted sums of the attributes that can 
be interpreted as different types of potentials. For example, the function 
Ac = fc(xQ,xi,... xjsj_i) may be interpreted as the compute potential of the system 
and another function = f^(xQ,xi,... xjs^_i) may be interpreted as the secondary 
storage potential. While the compute potential may be based on attributes that 
relate to the processing rate of the node the storage potential may be based on 

attributes that relate to the performance of the storage subsystem. Further, we could 
have functions that compute application specific potentials that could be useful if the 
Grid is used exclusively for particular sets of applications. 

While the above functions characterize the different Grid potentials of a node in 
terms of its operating rates, they are not sufficient to measure the different potentials. 
Therefore, a suite of corresponding “benchmarking” programs are introduced to 
measure the different potentials. 

Let F/ be a suite of benchmark programs meant to measure the potential that 

corresponds to function // . In the benchmark suite F/ ={Tq,... , Ty is a 

program specifically designed to evaluate attribute xj of the node. Designing such 

programs is feasible because only rate-based attributes are considered for computing 
the potentials of a node. For example, one of the benchmarking programs might be 
measuring the rate at which arithmetic operations are being executed. 

Definition 1: Node component potential ( p j ) with respect to attribute x j is defined 
as the number of operations performed by the node in one second as measured by the 
benchmarking program r ^ . 

The performance of a node with respect to an application depends on the rate at 
which the basic operations required by the application can be performed by the node, 
i.e., the ultimate node performance depends on a weighted average of the individual 
node component potentials. 

Definition 2: Weighted node potential ( p^ ) is defined as a weighted average of the 

c c c 

node component potentials {p^ , p^ , ... , i.e., 

W C C C 

P -(^OPq +(^IP\ +--- + ^A-1F^_1 

The node potential as expressed by the above equation can be considered as a 
function of the weighting factors and the node component potentials. The weighting 
factors determine the relative impotance of the different component potentials. In 
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addition to varying the weighting factors, the component potentials may be varied 
under certain situations. 

We define the potential induced by a machine i at the point of its attachment to 

the Grid as the local induced Grid potential and is defined as pj^ = p p^ where 
0<p<l. When the machine is exclusively used for Grid computations, p=l and 
0< p <1 otherwise. 

Definition 3: Grid potential (p ) is defined as the maximum of local induced Grid 

potentials. Suppose M machines are attached to a given node j , then the Grid 
potential at that node is given by 

=maxi^[Q 

The Grid potential induced at the point of attachment (node) drops off as we move 
away from the node along the Grid. This potential drop is dependent on the network 
characteristics. The Grid potential induced by a machine at a node other than its point 
of attachment to the Grid is defined as the remote induced Grid potential Consider a 

D 

machine that is attached to the Grid at node I Let pfj denote the remote induced 

D 

Grid potential of this machine at node j. The remote induced Grid potential pfj can 
be considered as the effective processing power of the machine at node j. 



3 Data Dissemination for Resource Discovery 



3.1 Overview 

Maintaining the consistency of the distributed status databases involves disseminating 
the status information. Based on the extent of message propagation, we can classify 
the data dissemination schemes into three groups. 

Universal awareness: This class of data dissemination algorithms distributes the 
status information such that a node can learn about every other node in the Grid. For 
large network sizes, the approaches in this group cause significant amount of 
communication due to large number of message transfers. 

Neighborhood awareness: The dissemination algorithms in this group propagate 
status information such that a node learns about the other nodes that are less than a 
fixed distance away from it. Although the approaches in this class limit the 
dissemination overhead and is scalable to very large network sizes, other components 
of the resource discovery mechanism should be able to handle the incomplete 
information in the status databases that are associated with the different nodes. 
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Distinctive awareness: Because the Grid is a highly heterogeneous system, various 
nodes on the Grid have different attributes. The nodes with distinct attributes are more 
significant. The extent of a node’s status information propagation is controlled by the 
significance of the node. If all nodes are homogeneous, an algorithm in this group 
reduces to an algorithm in the neighborhood awareness group. In a highly 
heterogeneous Grid, an algorithm in this group should deliver a resource discovery 
efficiency close to a universal awareness type algorithm while having a 
communication complexity closer to the neighborhood awareness algorithm. One way 
of implementing distinctive awareness is to use the Grid potential idea presented in 
the previous section. 



3.2 Data Dissemination Algorithms 

Figure 1 presents the pseudo-code for the dissemination algorithm that executes on 
each node. This particular algorithm uses the swamping approach for dissemination. 
Once a message comes into the node it is validated. The validation process 
implements the different types of dissemination: universal awareness, neighborhood 
awareness, and distinctive awareness. In universal awareness, the validation process 
permits all incoming messages. In the neighborhood awareness, it checks the distance 
from the source to the current node and discards the message if it exceeds the 
predefined limit. 

while (true) { 

//process incoming message 
receive messsage (X) { 

// vaiidate the incoming message: this may depend on the iocai poiicy 
// if universai awareness this function is aiways true 
// if neighborhood awareness returns true oniy 
// if the distance to source is iess than m 

//if distinctive awareness returns true oniy if the iocai Grid potentiai 
//is iess than or equai to remote induced iocai Grid potentiai 
if (validate(X)) { 

//update the data structures that keep awareness information in the node 
process(X) 

} 

//if there are no incoming message then break out the ioop to send messages 

} or timeout (n) 

if (currentTime > lastSentTime + n) { 
lastSentTime = currentTime 
//send to iogicai neighbors 
get the list of neighboring nodes Y 
foreach node in Y 

send status update message 

} 

} 



Figure 1: Pseudo-code for flooding based data dissemination. 
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The distinctive awareness is implemented by the validation routine discarding the 
message if the remote induced Grid potential at the local node is less than the Grid 
potential at the node. It should be noted that the Grid potential at the local node is the 
maximum of all local induced potentials. Therefore, the messages arriving from 
remote nodes that induce less remote potential at the local node than its own potential 
will be discarded. This creates a “masking problem” for nodes “behind” powerful 
nodes in a network. For example, if a network of nodes is connected to the rest of the 
network via a powerful node (as explained in earlier sections, we model the Grid as a 
connected graph with nodes representing machines), the powerful node will drop all 
incoming data dissemination messages. Thus, the powerful node will block the 
dissemination of the status information of the “interior nodes.” This masking problem 
is there when a flooding-based algorithm is used for data dissemination. A swamping- 
based algorithm that increases the neighborhood set as it discovers new nodes will be 
able to overcome this problem. 

To reduce the high message overhead of the swamping approach, it is possible to 
use a random node-based approach such as the Name-dropper algorithm [3]. Using 
the random node-based approach instead of the flooding approach avoids the masking 
problem. Consider the example situation where a powerful node connects a network 
of less powerful nodes to the rest of the network. As part of their update messages 
each node will advertise their immediate neighbors to the other nodes. Therefore, the 
nodes behind the powerful node will be reachable. 



3.3 Experimental Evaluation of the Algorithms 

To evaluate the performance of the various data dissemination schemes we devised 
the following simulation study. In this simulation study a computational Grid is 
modeled by a random graph with the nodes denoting the machines. The data 
dissemination scheme is responsible for updating the status database that is 
maintained at each node. Depending on the scheme that is under consideration, we 
might have a complete database at each node or an incomplete database at each node. 
We define data dissemination efficiency to be 100% if the particular data 
dissemination algorithm creates local database that is same as an ideal global 
database. Higher the value the above parameter is the more accurately the local 
database captures the actual global status picture. 

In the simulations, we “estimate” the above parameter by scheduling a stream of 
jobs onto the Grid using an ideal global database and local database. We use the same 
scheduling algorithm in both situations and the differences in the decisions taken 
gives a measure of the difference between the two databases. In addition to the above 
parameter, we also report another performance measure that is the schedule deviation. 
This parameter is, however, more dependent on the scheduling algorithm than the 
above parameter, i.e., it is dependent on how far the decisions taken by the scheduling 
algorithm is dependent on the completeness of the status information. 
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Figure 2: Variation of message complexity with network size. 
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Figure 3: Variation of dissemination efficiency with network size. 
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Figure 4: Variation of the schedule deviation with network size. 



Figure 2 shows the variation of the message complexity with network size for the 
different data dissemination schemes. Figure 3 shows the variation of the efficiency of 
data dissemination with network size and Figure 4 shows the variation of the schedule 
deviation with network size. 

From the above results, it can be observed that the message complexity of the 
neighborhood and distinctive approaches are about the same and much less than the 
universal approach. This is expected because in the universal approach, each node 
sends a message to every other node in the network. 



4 Related Work 

Because resource discovery is a fundamental operation in distributed computer 
systems it has been examined in a variety of distributed systems including: mobile 
computing, wireless sensor networks [4], high throughput computing [9], naming 
systems [1]. 

Several data dissemination algorithms based on the universal awareness scheme 
are examined in [3]. Their paper presents a new algorithm called the Name-Dropper 
that is proved to have a better communication complexity when compared with three 
other algorithms based on flooding, swamping, and random pointer jumping, 
respectively. Our study is different from [3] because we examine the trade-offs 
between various data dissemination approaches. 

Matchmaking [9] is a distributed resource management mechanism developed as 
part of the Condor [8] project for Grid systems. The matchmaking is based on the idea 
that resources providing services and clients requesting service advertise their 
characteristics and requirements using classified advertisements (classads). A 
matchmaker service that may be either centralized or distributed matches the client 
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requests to the appropriate resources. The matchmaking framework includes several 
components of a resource discovery mechanism. 

The classad specification defines the syntax and semantic rules for specifying the 
evaluating the attributes associated with the characteristics and requirements. The 
advertising protocol lays down the rules for disseminating the advertisements. Our 
study differs from their work because we examine techniques for performing efficient 
data dissemination to support resource discovery. It may be possible to use the classad 
language as the specification language in the implementation of our scheme. 



5 Conclusions 

In this paper, we examine various strategies for data dissemination. We introduce a 
new class of data dissemination strategies called the distinctive awareness. This class 
of strategies can result in algorithms that have improved resource discovery efficiency 
with reduced communication overhead. We use a new concept called the Grid 
potential for implementing this class of algorithms. The Grid potential quantifies the 
relative processing powers of the different machines in a Grid. 

We performed simulation studies to examine the performance trade-offs of the 
different data dissemination schemes. Several aspects of the Grid potential concept 
needs further investigation. One of them is to use application based measurement 
strategies for the Grid potential instead of using special benchmarks as proposed in 
this paper. Another one would be construct theoretical performance models for data 
dissemination algorithms that belong to the distinctive awareness category. 

In summary, this paper introduces a new class of data dissemination for resource 
discovery in distributed computing systems and in particular for resource discovery in 
Grid systems. A novel idea called the Grid potential is also presented. 
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Abstract*. We address the problem of how many workers should be allocated 
for executing a distributed application that follows the master-worker paradigm, 
and how to assign tasks to workers in order to maximize resource efficiency and 
minimize application execution time. We propose a simple but effective 
scheduling strategy that dynamically measures the execution times of tasks and 
uses this information to dynamically adjust the number of workers to achieve a 
desirable efficiency, minimizing the impact in loss of speedup. The scheduling 
strategy has been implemented using an extended version of MW, a runtime 
library that allows quick and easy development of master-worker computations 
on a computational grid. We report on an initial set of experiments that we 
have conducted on a Condor pool using our extended version of MW to 
evaluate the effectiveness of the scheduling strategy. 



1. Introduction 

In the last years. Grid computing [1] has become a real alternative to traditional 
supercomputing environments for developing parallel applications that harness 
massive computational resources. However, by its definition, the complexity incurred 
in building such parallel Grid-aware applications is higher than in traditional parallel 
computing environments. Users must address issues such as resource discovery, 
heterogeneity, fault tolerance and task scheduling. Thus, several high-level 
programming frameworks have been proposed to simplify the development of large 
parallel applications for Computational Grids (for instance, Netsolve [2], Nimrod/G 
[3], MW [4]). 

Several programming paradigms are commonly used to develop parallel programs 
on distributed clusters, for instance. Master- Worker, Single Program Multiple Data 
(SPMD), Data Pipelining, Divide and Conquer, and Speculative Parallelism [5]. Prom 
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the previously mentioned paradigms, the Master- Worker paradigm (also known as 
task farming) is especially attractive because it can be easily adapted to run on a Grid 
platform. The Master- Worker paradigm consists of two entities: a master and multiple 
workers. The master is responsible for decomposing the problem into small tasks (and 
distributes these tasks among a farm of worker processes), as well as for gathering the 
partial results in order to produce the final result of the computation. The worker 
processes execute in a very simple cycle: receive a message from the master with the 
next task, process the task, and send back the result to the master. Usually, the 
communication takes place only between the master and the workers at the beginning 
and at the end of the processing of each task. This means that, master-worker 
applications usually exhibit a weak synchronization between the master and the 
workers, they are not communication intensive and they can be run without 
significant loss of performance in a Grid environment. 

Due to these characteristics, this paradigm can respond quite well to an 
opportunistic environment like the Grid. The number of workers can be adapted 
dynamically to the number of available resources so that, if new resources appear they 
are incorporated as new workers in the application. When a resource is reclaimed by 
its owner, the task that was computed by the corresponding worker may be 
reallocated to another worker. 

In evaluating a Master- Worker application, two performance measures of 
particular interest are speedup and efficiency. Speedup is defined, for each number of 
processors n, as the ratio of the execution time when executing a program on a single 
processor to the execution time when n processors are used. Ideally we would expect 
that the larger the number of workers assigned to the application the better the 
speedup achieved. Efficiency measures how good is the utilization of the n allocated 
processors. It is defined as the ratio of the time that n processors spent doing useful 
work to the time those processors would be able to do work. Efficiency will be a 
value in the interval [0,1]. If efficiency is becoming closer to 1 as processors are 
added, we have linear speedup. This is the ideal case, where all the allocated workers 
can be kept usefully busy. 

In general, the performance of master-worker applications will depend on the 
temporal characteristics of the tasks as well as on the dynamic allocation and 
scheduling of processors to the application. In this work, we consider the problem of 
maximizing the speedup and the efficiency of a master-worker application through 
both the allocation of the number of processors on which it runs and the scheduling of 
tasks to workers at runtime. 

We address this goal by first proposing a generalized master-worker framework, 
which allows adaptive and reliable management and scheduling of master-worker 
applications running in a computing environment composed of opportunistic 
resources. Secondly, we propose and evaluate experimentally an adaptive scheduling 
strategy that dynamically measures application efficiency and task execution times, 
and uses this information to dynamically adjust the number of processors and to 
control the assignment of tasks to workers. 

The rest of the paper is organized as follows. Section 2 reviews related work in 
which the scheduling of master-worker applications on Grid environments was 
studied. Section 3 presents the generalized Master- Worker paradigm. Section 4 
presents a definition of the scheduling problem and outlines our adaptive scheduling 
strategy for master- worker applications. Section 5 describes the prototype 
implementation of the scheduling strategy and section 6 shows some experimental 
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data obtained when the proposed scheduling strategy was applied to some synthetic 
applications on a real grid environment. Section 7 summarizes the main results 
presented in this paper and outlines our future research directions. 



2. Related Work 

One group of studies has considered the problem of scheduling master-worker 
applications with a single set of tasks on computational grids. They include AppLeS 
[6], NetSolve [7] and Nimrod/G [3]. 

The AppLeS (Application-Level Scheduling) system focuses on the development 
of scheduling agents for parallel metacomputing applications. Each agent is written in 
a case-by-case basis and each agent will perform the mapping of the user’s parallel 
application [8]. To determine schedules, the agent must consider the requirements of 
the application and the predicted load and availability of the system resources at 
scheduling time. Agents use the services offered by the NWS (Network Weather 
Service) [9] to monitor the varying performance of available resources. 

NetSolve [2] is a client-agent- server system, which enables the user to solve 
complex scientific problems remotely. The NetSolve agent does the scheduling by 
searching for those resources that offer the best performance in a network. The 
applications need to be built using one of the API’s provided by NetSolve to perform 
RPC-like computations. There is an API for creating task farms [7] but it is targeted 
to very simple farming applications that can be decomposed by a single bag of tasks. 

Nimrod/G [3] is a resource management and scheduling system that focuses on the 
management of computations over dynamic resources scattered geographically over 
wide-area networks. It is targeted to scientific applications based on the “exploration 
of a range of parameterized scenarios” which is similar to our definition of master- 
worker applications, but our definition allows a more generalized scheme of farming 
applications. The scheduling schemes under development in Nimrod/G are based on 
the concept of computational economy developed in the previous implementation of 
Nimrod, where the system tries to complete the assigned work within a given deadline 
and cost. The deadline represents a time which the user requires the result and the cost 
represents an abstract measure of what the user is willing to pay if the system 
completes the job within the deadline. Artificial costs are used in its current 
implementation to find sufficient resources to meet the user’s deadline. 

A second group of researchers has studied the use of parallel application 
characteristics by processor schedulers of multiprogrammed multiprocessor systems, 
typically with the goal of minimizing average response time [10, 11]. However, the 
results from these studies are not applicable in our case because they were focussed 
basically on the allocation of jobs in shared memory multiprocessors in which the 
computing resources are homogeneous and available during all the computation. 
Moreover, most of these studies assume the availability of accurate historical 
performance data, provided to the scheduler simultaneously with the job submission. 
They also focus on overall system performance, as opposed to the performance of 
individual applications, and they only deal with the problem of processor allocation, 
without considering the problem of task scheduling within a fixed number of 
processors as we do in our strategy. 
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3. A Generalized Master-Worker Paradigm 

In this work, we focus on the study of applications that follow a generalized Master- 
Worker paradigm because it is used by many scientific and engineering applications 
like software testing, sensitivity analysis, training of neural-networks and stochastic 
optimization among others. In contrast to the simple master- worker model in which 
the master solves one single set of tasks, the generalized master- worker model can be 
used to solve of problems that require the execution of several batches of tasks. Figure 
1 shows an algorithmic view of this paradigm. 



Initialization 

Do 

For task = 1 to N 


i 


PartialResult = + Function (task) 


end ^ - 


act on bach comnlotol \ d 


1 






while (end condition not met). 



Worker 

Tasks 

Master 

Tasks 



Fig. 1. Generalized Master- Worker algorithm 

A Master process will solve the N tasks of a given batch by looking for Worker 
processes that can run them. The Master process passes a description (input) of the 
task to each Worker process. Upon the completion of a task, the Worker passes the 
result (output) of the task back to the Master. The Master process may carry out some 
intermediate computation with the results obtained from each Worker as well as some 
final computation when all the tasks of a given batch are completed. After that a new 
batch of tasks is assigned to the Master and this process is repeated several times until 
completion of the problem, that is, K cycles (which are later refereed as iterations). 

The generalized Master- Worker paradigm is very easy to program. All algorithm 
control is done by one process, the Master, and having this central control point 
facilitates the collection of job’s statistics, a fact that is used by our scheduling 
mechanism. Furthermore, a significant number of problems can be mapped naturally 
to this paradigm. N-body simulations [12], genetic algorithms [13], Monte Carlo 
simulations [14] and materials science simulations [15] are just a few examples of 
natural computations that fit in our generalized master- worker paradigm. 



4. Challenges for Scheduling of Master- Worker Applications 

In this section, we give a more precise definition of the scheduling problem for 
master- worker applications and we introduce our scheduling policy. 
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4.1. Motivations and Background 

Efficient scheduling of a master-worker application in a cluster of distributively 
owned resources should provide answers to the following questions: 

• How many workers should be allocated to the application? A simple approach 
would consist of allocating as many workers as tasks are generated by the 
application at each iteration. However, this policy will incur, in general, in poor 
resource utilization because some workers may be idle if they are assigned a short 
task while other workers may be busy if they are assigned long tasks. 

• How to assign tasks to the workers? When the execution time incurred by the tasks 
of a single iteration is not the same, the total time incurred in completing a batch of 
tasks strongly depends on the order in which tasks are assigned to workers. 
Theoretical works have proved that simple scheduling strategies based on list- 
scheduling can achieve good performance [16]. 

We evaluate our scheduling strategy by measuring the efficiency and the total 
execution time of the application. 

Resource efficiency (E) for n workers is defined as the ratio between the amount of 
time workers spent doing useful work and the amount of time workers were able to 
perform work. 



Y T 

/ j work,i 



E = 



y r -Y r 

up, I SUSP, I 



n: Number of workers. 

Amount of time that worker i spent doing useful work. 

T^^ :. Time elapsed since worker i is alive until it ends. 

Amount of time that worker i is suspended, that is, when it cannot do any 

work. 

Execution Time (ETJ is defined as the time elapsed since the application begins its 
execution until it finishes, using n workers. 

ET = T. . - T 

rmish,n begin,n 

^finish,!!- Time of the ending of the application when using n workers. 

Tbegin,n- Time of the beginning of the application workers. 

As [17] we view efficiency as an indication of benefit (the higher the efficiency, 
the higher the benefit), and execution time as an indication of cost (the higher the 
execution time, the higher the cost). The implied system objective is to achieve 
efficient usage of each processor, while taking into account the cost to users. It is 
important to know, or at least to estimate the number of processors that yield the point 
at which the ratio between efficiency to execution time is maximized. This would 
represent the desired allocation of processors to each job. 
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4.2. Proposed Scheduling Policy 

We have considered a group of master- worker applications with an iterative behavior. 
In these iterative parallel applications a batch of parallel tasks is executed K times 
(iterations). The completion of a given batch induces a synchronization point in the 
iteration loop, followed by the execution of a sequential body. This kind of 
applications has a high degree of predictability, therefore it is possible to take 
advantage of it to decide both the use of the available resources and the allocation of 
tasks to workers. 

Empirical evidence has shown that the execution of each task in successive 
iterations tends to behave similarly, so that the measurements taken for a particular 
iteration are good predictors of near future behavior [15]. As a consequence, our 
current implementation of adaptive scheduling employs a heuristic-based method that 
uses historical data about the behavior of the application, together with some 
parameters that have been fixed according to results obtained by simulation. 

In particular, our adaptive scheduling strategy collects statistics dynamically about 
the average execution time of each task and uses this information to determine the 
number of processors to be allocated and the order in which tasks are assigned to 
processors. Tasks are sorted in decreasing order of their average execution time. 
Then, they are assigned to workers according to that order. At the beginning of the 
application execution, no data is available regarding the average execution time of 
tasks. Therefore, tasks are assigned randomly. We call this adaptive strategy Random 
and Average for obvious reasons. 

Initially as many workers as tasks per iteration (N) are allocated for the application. 
We first ask for that maximum number of workers because getting machines in an 
opportunistic environment is time-consuming. Once we get the maximum number of 
machines at the start of an application, we release machines if needed, instead of 
getting a lower number of machines and asking for more. 

Then, at the end of each iteration, the adequate number of workers for the 
application is determined in a two-step approach. The first step quickly reduces the 
number of workers trying to approach the number of workers to the optimal value. 
The second step carries out a fine correction of that number. If the application 
exhibits a regular behavior the number of workers obtained by the first step in the 
initial iterations will not change, and only small corrections will be done by the 
second step. 

The first step determines the number of workers according to the workload 
exhibited by the application. Table 1 is an experimental table that has been obtained 
from simulation studies. In these simulations we have evaluated the performance of 
different strategies (including Random and Average policy) to schedule tasks of 
master- worker applications. We tested the influence of several factors: the variance 
of tasks execution times among iterations, the balance degree of work among tasks, 
the number of iterations and the number of workers used [18]. 

Table 1 shows the number of workers needed to get efficiency greater than 80% 
and execution time less than 1.1 the execution time when using N workers. These 
values would correspond to a situation in which resources are busy most of the time 
while the execution time is not degraded significantly. 
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Table 1. Percentage of workers with respect to the number of tasks. 



Workload 


<30% 


30% 


40% 


50% 


60% 


70% 


80% 


90% 


%workers (largest tasks similar 
size) 


Ntask 


10% 


55% 


45% 


40% 


35% 


30% 


25% 


%workers (largest tasks diff. size) 


60% 


45% 


35% 


30% 


25% 


20% 


20% 


20% 



The first row contains the workload, defined as the work percentage done when 
executing the largest 20% tasks. The second and third rows contain the workers 
percentage with respect to the number of tasks for a given workload in the cases that 
the 20% largest tasks have similar and different executions times respectively. 

For example, if the 20% largest tasks have carried out 40% of the total work then 
the number of workers to allocate will be either A/^*0,55 or A^*0,35. The former value 
will be used if the largest tasks are similar, otherwise the later value is applied. 
According to our simulation results the largest tasks are considered to be similar if 
their execution time differences are not greater than 20%. 

The fine correction step is carried out at the end of each iteration when the 
workload between iterations remains constant and the ratio between the last iteration 
execution time and the execution time with the current number of workers given by 
table 1 is less than 1.1. This correction consists of diminishing by one the number of 
workers if efficiency is less than 0.8, and observing the effects on the execution time. 
If it gets worse a worker is added, but never surpassing the value given by table 1. 
The complete algorithm is shown in figure 2. 



1 . In the first iteration Nworkers = Ntasks 

Next steps are executed at the end of each iteration i. 

2. Compute Efficiency, Execution Time, Workload and the Differences of the execution times of the 
20% largest tasks. 

3. if(/==2) 

Set Nworkers = NinitWorkers according to Workload and Differences of Table 1. 

else 

if (Workload of iteration i != Workload of iteration i-1) 

Set Nworkers = NinitWorkers according to Workload and Differences of Table 1 
else 

if (Execution Time of it. i DIV Execution Time of it. 2 (with NinitWorkers) <= 1.1) 
if (Efficiency of iteration / < 0.8) 

Nworkers - Nworkers - 1 

else 

Nworkers = Nworkers + 1 



Fig. 2. Algorithm to determine Nworkers. 
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5. Current Implementation 

To evaluate both the proposed scheduling algorithm and the technique to adjust the 
number of workers we have run experiments on a Grid environment using MW 
library as a Grid middleware. First, we will briefly review the main characteristics of 
MW and then we will summarize the extensions included to support both our 
generalized master- worker paradigm and the adaptive scheduling policy. 



5.1. Overview of MW 

MW is a runtime library that allows quick and easy development of master-worker 
computations on a computational grid [4]. It handles the communication between 
master and workers, asks for available processors and performs fault-detection. An 
application in MW has three base components: Driver, Tasks and Workers. The 
Driver is the master, who manages a set of user-defined tasks and a pool of workers. 
The Workers execute Tasks. To create a parallel application the programmer needs to 
implement some pure virtual functions for each component. 

Driver: This is a layer that sits above the program’s resource management and 

message passing mechanisms. (Condor [19] and PVM [20], respectively, in the 
implementation we have used). The Driver uses Condor services for getting machines 
to execute the workers and to get information about the state of those machines. It 
creates the tasks to be executed by the workers, sends tasks to workers and receives 
the results. It handles workers joining and leaving the computation and rematches 
running tasks when workers are lost. To create the Driver, the user needs to 
implement the following pure virtual functions: 

• get_userinfo(): Processes arguments and does initial setup. 

• setup_initial_tasks(): Creates the tasks to be executed by the workers. 

• pack_worker_init_data(): Packs the initial data to be sent to the worker upon 
startup. 

• act_on_completed_task(): This is called every time a task finishes. 

Task: This is the unit of work to be done. It contains the data describing the tasks 
(inputs) and the results (outputs) computed by the worker. The programmer needs to 
implement functions for sending and receiving this data between the master and the 
worker. 

Worker: This executes the tasks sent to it by the master. The programmer needs to 
implement the following functions: 

• unpack_init_data(): Unpacks the initialization data passed in the Driver 

pack_worker_init_data() function. 

• execute_task(): Computes the results for a given task. 
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5.2. Extended Version of MW 

In its original implementation, MW supported one master controlling only one set of 
tasks. Therefore we have extended the MW API to support our programming model, 
the Random and Average scheduling policy and to collect useful information to adjust 
the number of workers. 

To create the master process the user needs to implement another pure virtual 
function: global_task_setup. There are also some changes in the functionality of 
some others pure virtual functions: 

• global_task_setup(): It initializes the data structures needed to keep the tasks 
results the user want to record. This is called once, before the execution of the first 
iteration. 

• setup_initial_tasks (iterationNumber): The set of tasks created depends on the 
iteration number. So, there are new tasks for each iteration, and these tasks could 
depend on values returned by the execution of previous tasks. This function is 
called before each iteration begins, and creates the tasks to be executed in the 
iterationNumber iteration. 

• get_userinfo(): The functionality of this function remains the same, but the user 
needs to call the following initialization functions there: 

- set_iteration_number (n): This is used to set the number of times tasks will be 
created and executed, that is, the number of iterations. If INFINITY is used to 
set the iterations number, then tasks will be created and executed until an end 
condition is achieved. This condition needs to be set in the function 

end_condition(). 

- set_Ntasks (n): This is used to set the number of tasks to be executed per 
iteration. 

- set_task_retrive_mode (mode): This function allows the user to select the 
scheduling policy. It can be FIFO (GET_FROM_BEGIN), based on a user 
key (GET_FROM_KEY), random (GET_RANDOM) or random and average 
(GET_RAND_AVG). 

- printresults (iterationNumber): It allows the results of the iterationNumber 
iteration to be printed. 

In addition to the above changes, the MNJ Driver collects statistics about tasks 
execution time, workers’ state (when they are alive, working and suspended), and 
about iteration beginning and ending. 

At the end of each iteration, function Update WorkersNumber() is called to adjust 
the number of workers accordingly with regard to the algorithm explained in the 
previous section. 



6. Experimental Study in a Grid Platform 



In this section we report the preliminary set of results obtained with the aim of testing 
the effectiveness of the proposed scheduling strategy. We have executed some 
synthetic master-worker applications that could serve as representative examples of 
the generalized master-workers paradigm. We run the applications on a grid platform 
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and we have evaluated the ability of our scheduling strategy to dynamically adapt the 
number of workers without any a priori knowledge about the behavior of the 
applications. 

We have conducted experiments using a grid platform composed of a dedicated 
Linux cluster running Condor, and a Condor pool of workstations at the University of 
Wisconsin. The total number of available machines was around 700 although we 
restrict our experiments to machines with Linux architecture (both from the dedicated 
cluster and the Condor pool). The execution of our application was carried out using 
the grid services provided by Condor for resource requesting and detecting, 
determining information about resources and fault detecting. The execution of our 
application was carried out with a set of processors that do not exhibit significant 
differences in performance, so that the platform could be considered to be 
homogeneous. 

Our applications executed 28 synthetic tasks at each iteration. The number of 
iterations was fixed to 35 so that the application was running in a steady state most of 
the time. Each synthetic task performed the computation of a Fibonacci series. The 
length of the series computed by each task was randomly fixed at each iteration in 
such a way that the variation of the execution time of a given task in successive 
iterations was 30%. We carried out experiments with two synthetic applications that 
exhibited a workload distribution of 30% and 50% approximately. In the former case, 
all large tasks exhibited a similar execution time. In the latter case, the execution time 
of larger tasks exhibited significant differences. These two synthetic programs can be 
representative examples for master-worker applications with a highly balanced 
distribution of workload and medium balanced distribution of workload between 
tasks, respectively. Figure 3 shows, for instance, the average and the standard 
deviation time for each of the 28 tasks in the master-worker with a 50% workload. 

Different runs on the same programs generally produced slightly different final 
execution times and efficiency results due to the changing conditions in the grid 
environment. Hence, average-case results are reported for sets of three runs. 

Tables 2 and 3 show the efficiency, the execution time (in seconds) and the 
speedup obtained by the execution of the master-worker application with 50% 
workload and 30% workload, respectively. The results obtained by our adaptive 
scheduling are shown in bold in both tables. In addition to these results, we show the 
results obtained when a fixed number of processors were used during the whole 
execution of the application. In particular, we tested a fixed number of processors of 
n=28, n=25, n=20, n=15, n=10, n=5 and n=l. In all cases the order of execution was 
carried out according to the sorted list of average execution time (as described in 
previous section for the Random and Average policy). The execution time for n=l 
was used to compute the speedup of the other cases. It is worth pointing out that the 
number of processors allocated by our adaptive strategy was obtained basically 
through table 1. Only in the case of 30% workload, did the fine adjustment carry out 
the additional reduction of the number of processors. 

The first results shown in tables 2 and 3 are encouraging as they prove that an 
adaptive scheduling policy like Random and Average was able, in general, to achieve 
a high efficiency in the use of resources while the speedup was not degraded 
significantly. The improvement in efficiency can be explained because our adaptive 
strategy tends to use a small number of resources with the aim of avoiding idle time in 
workers that compute short tasks. In general, the larger the number of processors the 
larger the idle times incurred by workers in each iteration. This situation is also more 
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remarkable when the workload of the application is more unevenly distributed among 
tasks. Therefore, for a given number of processors the largest loss of efficiency was 
obtained normally in the application with a 50% workload. 



Tasks Average Execution Time 




Table 2. Experimental results in the execution of a master- worker application with 50% 
workload using the Random and Average policy. 



#Workers 


1 


5 


8 


10 


15 


20 


25 


28 


Efficiency 


1 


0,94 


0,80 


0,65 


0,43 


0,33 


0,28 


0,22 


Exec. Time 


80192 


16669,5 


12351 


12365 


13025 


12003 


12300,5 


12701 


Speedup 


1 


4,81 


6,49 


6,49 


6,16 


6,68 


6,52 


6,31 



Table 3. Experimental results in the execution of a master- worker application with 30% 
workload using the Random and Average policy. 



#Workers 


1 


5 


10 


15 


18 


20 


25 


28 


Efficiency 


1 


0,85 


0,85 


0,87 




0,72 


0,59 


0,55 


Exec. Time 


36102 


9269 


4255 


3027 




2710 


2794 


2434 


Speedup 


1 


3,89 


8,48 


11,93 




13,32 


12,92 


14,83 



It can also be observed in both tables that the adaptive scheduling strategy obtained 
in general an execution time that was similar or even better than the execution time 
obtained with a larger number of processors. This result basically reflects the 
opportunistic nature of the resources that were used in our experiments. The larger the 
number of processors allocated, the larger the number of task suspensions and 
reallocations incurred at run time. The need to terminate a task prematurely when the 
user claimed back the processor prevented normally the benefits in execution time 
obtained by the use of additional processors. Therefore, from our results, we conclude 
that, the reduction in the number of processors allocated to an application running in 
an opportunistic environment is good not only because it improves overall efficiency. 
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but it also avoids side effects on the execution time due to suspensions and 
reallocations of tasks. 

As is perhaps to be expected, the best performance was normally obtained when 
the largest number of machines were used, although better machine efficiencies were 
obtained when a smaller number of machines were used. These results may seem to 
be obvious, but it should be stressed that they have been obtained from a real test-bed, 
in which resources were obtained from a total pool of non-dedicated 700 machines. In 
this test-bed our adaptive scheduler used only statistics information collected at 
runtime, and the execution of our applications should copse with the effects of 
resource obtaining, local suspension of tasks, task reassume and dynamic 
redistribution of load. 

We carried out an additional set of experiments in order to evaluate the influence in 
the order of task assignment. Due to time constraints, this article only contains the 
results obtained when a master- worker application with 50% workload was scheduled 
using a Random policy. In this policy, when a worker becomes idle, a random task 
from the list of pending tasks is chosen and assigned to it. As can be seen when tables 
2 and 4 are compared, the order in which tasks are assigned has a significant impact 
when a small number of workers is used. For less than 15 processors the Random and 
Average policy performs significantly better than the Random policy, both in 
efficiency and in execution time. When 15 or more processors are used, differences 
between both policies were nearly negligible. This fact can be explained because 
when the Random policy has a large number of available processors, the probability 
to assign a large task at the beginning is also large. Therefore, in these situations the 
assignments carried out by both polices are likely to follow a similar order. Only in 
the case of 20 processors, was Random’s performance significantly worse than 
Random & Average. However, this could be explained because the tests of the 
Random policy with 20 processors suffered from many task suspensions and 
reallocations during their execution. 

Table 4. Experimental results for Random scheduling with a master- worker application with 

50% workload. 



#Workers 


1 


5 


10 


15 


20 


25 


28 


Efficiency 


1 


0,80 


0,56 


0,40 


0,34 


0,26 


0,26 


Exec. Time 


80192 


20055 


14121 


13273 


13153 


12109 


12716 


Speedup 


1 


4,00 


5,68 


6,04 


6,10 


6,62 


6,31 



7. Conclusions and Future Work 



In this paper, we have discussed the problem of scheduling master-worker 
applications on the computational grid. We have presented a framework for master- 
worker applications that allow the development of a tailored scheduling strategy. We 
have proposed a scheduling strategy that is both simple an adaptive and takes into 
account the measurements taken during the execution of the master-worker 
application. This information is usually a good predictor of near future behavior of the 
application. Our strategy tries to allocate and schedule the minimum number of 
processors that guarantees a good speedup by keeping the processors as busy as 
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possible and avoiding situations in which processors sit idle waiting for work to be 
done. The strategy allocates the suitable number of processors by using the runtime 
information obtained from the application, together with the information contained in 
an empirical table that has been obtained by simulation. Later, the number of 
processors would eventually be adapted dynamically if the scheduling algorithm 
detects that the efficiency of the application can be improved without significant 
losses in performance. 

We have built our scheduling strategy using MW as a Grid middleware. And we 
tested the scheduling strategy on a Grid environment made of several pools of 
machines, the resources of which were provided by Condor. The preliminary set of 
tests with synthetic applications allowed us to validate the effectiveness of our 
scheduling strategy. In general, our adaptive scheduling strategy achieved an 
efficiency in the use of processors close to 80% while the speedup up of the 
application was close to the speedup achieved with the maximum number of 
processors. Moreover, we have observed that our algorithm quickly achieves a stable 
situation with a fixed number of processors. 

There are some ways in which this work can be extended. We have tested our 
strategy on a homogeneous Grid platform where the resources were relatively closed 
and the influence of the network latency was negligible. A first extension will adapt 
the proposed scheduling strategy to handle a heterogeneous set of resources. In order 
to carry this out, a normalizing factor should be applied to the average execution 
times to index table 1. Another extension will focus on the inclusion of additional 
mechanisms that can be used when the distance between resources is significant (for 
instance, by packing more than one task to a distant worker in order to compensate 
network delays). A second extension will be oriented to the extension of the 
scheduling strategy to be applied for applications that are not iterative or that exhibit 
different behaviors at different phases of the execution. This extension would be 
useful for applications that follow, for instance, a Divide and Conquer paradigm or a 
Speculative Parallelism paradigm. 
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